I'd also be very much interested in knowing these! On 11/18/2012 07:32 PM, kiran chitturi wrote:
Hi!I have been running crawls using Nutch for 13000 documents (protocol http) on a single machine and it goes on to take 2-3 days to get finished. I am using 2.x version of Nutch. I use a depth of 20 and topN of 1000 (2000) when i initiate the 'sh bin/nutch crawl -depth 20 -topN 1000'. I keep running in to Exceptions after one day. Sometimes its - Memory Exception : Heap Space (after the parsing of the documents) - Mysql Connection Error (because the crawler went on to fetch 10,000 documents after the command 'sh bin/nutch crawl -continue -depth 10 -topN 700' as the crawl failed because I increased the heap space and increased the timeout. I am wondering what are the best practices to run Nutch crawls. Is a full crawl a good thing to do or should i do it in steps (generate, fetch, parse, updatedb) ? Also how do i choose the value of the parameters, even if i give topN as 700 the fetcher goes to fetch 3000 documents. What parameters have high impact on the running time of the crawl ? All these options might be system based and need not have general values which work for everyone. I am wondering what are things that Nutch Users and Developers follow here when running big crawls ? Some of the exceptions come after 1 or 2 days of running the crawler, so its getting hard to know how to fix them before hand. Are there any common exceptions that Nutch can run in to frequently ? Is there any documentation for Nutch practices ? I have seen people crawls go for a long time because of the filtering sometimes. Sorry for the long email. Thank you,

