Re: Best practices for running Nutch

Muzaffer Tolga Özses Sun, 18 Nov 2012 23:34:30 -0800

I'd also be very much interested in knowing these!

On 11/18/2012 07:32 PM, kiran chitturi wrote:

Hi!


I have been running crawls using Nutch for 13000 documents (protocol http)
on a single machine and it goes on to take 2-3 days to get finished. I am
using 2.x version of Nutch.

I use a depth of 20 and topN of 1000 (2000) when i initiate the 'sh
bin/nutch crawl -depth 20 -topN 1000'.

I keep running in to Exceptions after one day. Sometimes its


    - Memory Exception : Heap Space (after the parsing of the documents)
    - Mysql Connection Error (because the crawler went on to fetch 10,000
    documents after the command 'sh bin/nutch crawl -continue -depth 10 -topN
    700' as the crawl failed because

I increased the heap space and increased the timeout.

I am wondering what are the best practices to run Nutch crawls. Is a full
crawl a good thing to do or should i do it in steps (generate, fetch,
parse, updatedb) ?

Also how do i choose the value of the parameters, even if i give topN as
700 the fetcher goes to fetch 3000 documents. What parameters have high
impact on the running time of the crawl ?

All these options might be system based and need not have general values
which work for everyone.

I am wondering what are things that Nutch Users and Developers follow here
when running big crawls ?

Some of the exceptions come after 1 or 2 days of running the crawler, so
its getting hard to know how to fix them before hand. Are there any common
exceptions that Nutch can run in to frequently ?

Is there any documentation for Nutch practices ? I have seen people crawls
go for a long time because of the filtering sometimes.

Sorry for the long email.

Thank you,

Re: Best practices for running Nutch

Reply via email to