Re: Best practices for running Nutch

Julien Nioche Mon, 19 Nov 2012 04:11:14 -0800

>
> > I use a depth of 20 and topN of 1000 (2000) when i initiate the 'sh
> > bin/nutch crawl -depth 20 -topN 1000'.
> >
> > I keep running in to Exceptions after one day. Sometimes its
> >
> >
> >    - Memory Exception : Heap Space (after the parsing of the documents)
>
> After parsing the documents? That should be during updatedb but are you
> sure? That job hardly ever runs out of memory.
>


The crawl class is deprecated and you should use the crawl script instead
or write your own script to call the commands individually. The crawl class
can indeed have memory issues with runaway parse threads (i.e. time outs).
This is mentioned on this list on a regular basis.


>
> >    - Mysql Connection Error (because the crawler went on to fetch 10,000
> >    documents after the command 'sh bin/nutch crawl -continue -depth 10
> -topN
> >    700' as the crawl failed because
> >
>

We have numerous bugs filed in JIRA for the SQL backend. My advice would be
to use a more stable one like HBase


>
> >
> > Also how do i choose the value of the parameters, even if i give topN as
> > 700 the fetcher goes to fetch 3000 documents. What parameters have high
> > impact on the running time of the crawl ?
>
> Are you sure? The generator (at least in trunk) honors the topN parameter
> and will not generate more than specified. Keep in mind that using the
> crawl script and the depth parameter you're multiplying topN by depth.
>

See my comment above. The SQL backend should be considered broken


Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Best practices for running Nutch

Reply via email to