Hi, I am using Nutch 0.9, and I'm trying to crawl our Intranet site (~60,000 pages, ~28,000 htmls). I've seen other posts where people mentioned they can get their crawler to do 20pages/sec, and the best I've seen so far is only 8 pages/sec.
I've also read that the fetcher threads tend to block when it tries to fetch pages from the same host. So I'm wondering what kind of configurations should I set to get the best performance, my current configurations in nutch-site.xml is as follows: <property> <name>fetcher.threads.fetch</name> <value>200</value> </property> <property> <name>fetcher.threads.per.host</name> <value>50</value> </property> <property> <name>http.max.delays</name> <value>1</value> </property> Any pointers are greatly appreciated!! Thanks in advance. AL -- View this message in context: http://www.nabble.com/tweaking-config-files-for-better-performance-tf4119552.html#a11715927 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
