forgot one important one: set "generate.max.per.host" to something reasonable so you won't end up fetching urls from only low number of hosts which by default is very slow.
-- Sami Siren Sami Siren wrote: > Some simple rules for generally speeding things up > > 1. Crawl only the content you are going to handle handle (do not fetch > for example pdf-files if you don't need them, also disable all unneeded > parsers) > > 2. If using regex-urlfilter: If you don't need the rule > "-.*(/.+?)/.*?\1/.*?\1/" remove it (also keep the number of rules as > small as possible still remembering #1 and #3) > > 3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end > up parsing all kinds of binary content with text parser. > > You might also check the variables like "fetcher.server.delay" and > "fetcher.threads.per.host". (and remember to keep your fetcher polite!) > > I am using something like 300 for "fetcher.threads" for fetching with > 0.8.1 single athlon 64, 1 GB of memory. > > I am also in process of fixing some IO related bottlenecks and will get > back to that hopefully sooner than later. > > -- > Sami Siren > > > > > Marco Vanossi wrote: >> Hi, >> >> Do you have some hints that would improve speed for the following nutch >> commands? >> >> ./nutch generate db segments -topN 10000000 >> s=`ls -d segments/2* | tail -1` >> ./nutch fetch $s >> ./nutch updatedb db $s >> ./nutch index $s >> ./nutch dedup segments tmpfile >> >> I mean, do you have some hints for the numbers set in >> nutch-default.xmlfor, for example: >> fetcher.threads (I'm using 10.000), etc.... >> Let's say it is running on a machine with 12GB RAM, and 2.000GB HD. >> >> Thank you very much for any help. >> >> Marco >> > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
