On 2010-08-02 10:17, Scott Gonyea wrote:
The big problem that I am facing, thus far, occurs on the 4th fetch.
All but 1 or 2 maps complete. All of the running reduces stall (0.00
MB/s), presumably because they are waiting on that map to finish? I
really don't know and it's frustrating.

Yes, all map tasks need to finish before reduce tasks are able to proceed. The reason is that each reduce task receives a portion of the keyspace (and values) according to the Partitioner, and in order to prepare a nice <key, list(value)> in your reducer it needs to, well, get all the values under this key first, whichever map task produced the tuples, and then sort them.

The failing tasks probably fail due to some other factor, and very likely (based on my experience) the failure is related to some particular URLs. E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB long, or containing '\0' etc, etc. In my experience, it's best to keep regex filtering to a minimum if you can, and use other urlfilters (prefix, domain, suffix, custom) to limit your crawling frontier. There are simply too many ways where a regex engine can lock up.

Please check the logs of the failing tasks. If you see that a task is stalled you could also log in to the node, and generate a thread dump a few times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex processing then it's likely this is your problem.

My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I
thought S3 would be more performant, yet it doesn't appear to affect
matters. Cost vs Speed: I don't mind throwing EC2 instances at this
to get it done quickly... But I can't imagine I need much more than
10-20 mid-size instances for this.

That's correct - with this number of unique sites the max. throughput of your crawl will be ultimately limited by the politeness limits (# of requests/site/sec).


Can anyone share their own experiences in the performance they've
seen?

There is a very simple benchmark in trunk/ that you could use to measure the raw performance (data processing throughput) of your EC2 cluster. The real-life performance, though, will depend on many other factors, such as the number of unique sites, their individual speed, and (rarely) the total bandwidth at your end.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to