Dennis Kubes wrote:
> So we moved 50 machines to a data center for a beta cluster of a new 
> search engine based on Nutch and Hadoop.  We fired all of the machines 
> up and started fetching and almost immediately started experiencing JVM 
> crashes and checksum/IO errors which would cause jobs to fail, tasks to 
> fail, and random data corruption.  After digging through and fixing the 
> problems we have come up with some observations that may seem obvious 
> but may also help someone else avoid the same problems.

[..]

Thanks Dennis for sharing this - it's very useful.

I could add also the following from my experience: for medium-large 
scale crawling, i.e. in the order of 20-100 mln pages, be prepared to 
address the following issues:

* take a crash course in advanced DNS setup ;) I found that often the 
bottleneck lies in DNS and not just the raw bandwidth limits. If your 
fetchlist consist of many unique hosts, then Nutch will fire thousands 
of DNS requests per second. Using just an ordinary setup, i.e. without 
caching, is pointless (most of the time the lookups will time out) and 
harmful to the target DNS servers. You have to use a caching DNS - I 
have good experiences with djbdns / tinydns, but they also require 
careful tuning of max. number of requests, cache size, ignoring too 
short TTLs, etc.

* check your network infrastructure. I had a few cases of clusters that 
were giving sub-standard performance, only to find that e.g. cables were 
flaky. In most cases though it's the network equipment such as switches 
and routers - check their CPU usage, and the number of dropped packets. 
Some entry-level switches and routers, even though their interfaces 
nominally support gigabit speeds, their switching fabric and/or CPU 
don't support high packet rates - so they would peg at 100% cpu, and 
even if they don't show any lost packets, a 'ping -f' shows they can't 
handle the load.

* check OS level resource limits (ulimit -a on POSIX systems). In one 
installation we were experiencing weird crashes and finally discovered 
that datanodes and tasktrackers were hitting OS-wide limits of open file 
handles. In another installation the OS-wide limits were ok, but the 
limits on this particular account were insufficient.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to