Re: Seeking Insight into Nutch Configurations

Andrzej Bialecki Mon, 02 Aug 2010 03:47:58 -0700

On 2010-08-02 10:17, Scott Gonyea wrote:

The big problem that I am facing, thus far, occurs on the 4th fetch.
All but 1 or 2 maps complete. All of the running reduces stall (0.00
MB/s), presumably because they are waiting on that map to finish? I
really don't know and it's frustrating.

Yes, all map tasks need to finish before reduce tasks are able toproceed. The reason is that each reduce task receives a portion of thekeyspace (and values) according to the Partitioner, and in order toprepare a nice <key, list(value)> in your reducer it needs to, well, getall the values under this key first, whichever map task produced thetuples, and then sort them.

The failing tasks probably fail due to some other factor, and verylikely (based on my experience) the failure is related to someparticular URLs. E.g. regex URL filtering can choke on some pathologicalURLs, like URLs 20kB long, or containing '\0' etc, etc. In myexperience, it's best to keep regex filtering to a minimum if you can,and use other urlfilters (prefix, domain, suffix, custom) to limit yourcrawling frontier. There are simply too many ways where a regex enginecan lock up.

Please check the logs of the failing tasks. If you see that a task isstalled you could also log in to the node, and generate a thread dump afew times in a row (kill -SIGQUIT <pid>) - if each thread dump shows theregex processing then it's likely this is your problem.

My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I
thought S3 would be more performant, yet it doesn't appear to affect
matters. Cost vs Speed: I don't mind throwing EC2 instances at this
to get it done quickly... But I can't imagine I need much more than
10-20 mid-size instances for this.

That's correct - with this number of unique sites the max. throughput ofyour crawl will be ultimately limited by the politeness limits (# ofrequests/site/sec).


Can anyone share their own experiences in the performance they've
seen?

There is a very simple benchmark in trunk/ that you could use to measurethe raw performance (data processing throughput) of your EC2 cluster.The real-life performance, though, will depend on many other factors,such as the number of unique sites, their individual speed, and (rarely)the total bandwidth at your end.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Seeking Insight into Nutch Configurations

Reply via email to