Dennis Kubes wrote: > So we moved 50 machines to a data center for a beta cluster of a new > search engine based on Nutch and Hadoop. We fired all of the machines > up and started fetching and almost immediately started experiencing JVM > crashes and checksum/IO errors which would cause jobs to fail, tasks to > fail, and random data corruption. After digging through and fixing the > problems we have come up with some observations that may seem obvious > but may also help someone else avoid the same problems.
[..] Thanks Dennis for sharing this - it's very useful. I could add also the following from my experience: for medium-large scale crawling, i.e. in the order of 20-100 mln pages, be prepared to address the following issues: * take a crash course in advanced DNS setup ;) I found that often the bottleneck lies in DNS and not just the raw bandwidth limits. If your fetchlist consist of many unique hosts, then Nutch will fire thousands of DNS requests per second. Using just an ordinary setup, i.e. without caching, is pointless (most of the time the lookups will time out) and harmful to the target DNS servers. You have to use a caching DNS - I have good experiences with djbdns / tinydns, but they also require careful tuning of max. number of requests, cache size, ignoring too short TTLs, etc. * check your network infrastructure. I had a few cases of clusters that were giving sub-standard performance, only to find that e.g. cables were flaky. In most cases though it's the network equipment such as switches and routers - check their CPU usage, and the number of dropped packets. Some entry-level switches and routers, even though their interfaces nominally support gigabit speeds, their switching fabric and/or CPU don't support high packet rates - so they would peg at 100% cpu, and even if they don't show any lost packets, a 'ping -f' shows they can't handle the load. * check OS level resource limits (ulimit -a on POSIX systems). In one installation we were experiencing weird crashes and finally discovered that datanodes and tasktrackers were hitting OS-wide limits of open file handles. In another installation the OS-wide limits were ok, but the limits on this particular account were insufficient. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
