Andrzej Bialecki wrote: > Dennis Kubes wrote: >> So we moved 50 machines to a data center for a beta cluster of a new >> search engine based on Nutch and Hadoop. We fired all of the machines >> up and started fetching and almost immediately started experiencing >> JVM crashes and checksum/IO errors which would cause jobs to fail, >> tasks to fail, and random data corruption. After digging through and >> fixing the problems we have come up with some observations that may >> seem obvious but may also help someone else avoid the same problems. > > [..] > > Thanks Dennis for sharing this - it's very useful. > > I could add also the following from my experience: for medium-large > scale crawling, i.e. in the order of 20-100 mln pages, be prepared to > address the following issues: > > * take a crash course in advanced DNS setup ;) I found that often the > bottleneck lies in DNS and not just the raw bandwidth limits. If your > fetchlist consist of many unique hosts, then Nutch will fire thousands > of DNS requests per second. Using just an ordinary setup, i.e. without > caching, is pointless (most of the time the lookups will time out) and > harmful to the target DNS servers. You have to use a caching DNS - I > have good experiences with djbdns / tinydns, but they also require > careful tuning of max. number of requests, cache size, ignoring too > short TTLs, etc.
I completely agree although we use bind. DNS issues were one of the first things that came up when we first started using Nutch and Hadoop over a year ago. I remember that you pointed us toward caching DNS servers on the local machines at that time and that has made all of the difference. Originally we were using a single DNS server in the domain and by running large fetches (many fetchers at the same time) were were causing a DOS attack on our own server. And the memory on the server couldn't handle it so the entire fetch was slowing down and erroring. I will add one point here and that is that while we run caching servers on each machine we also using large dns caches for our lookup nameservers such as opendns and verizon. The idea being that if we don't have it, one of the large caches will and it is better to check them before going directly to global nameservers. Large caches will take one hop while global nameservers will take two. Here is what our resolv.conf looks like. The 208 servers are OpenDNS while the 4.x servers are verizon. Note is that both of these caches are open to requests from anywhere so anybody should be able to use them. nameserver 127.0.0.1 nameserver 208.67.222.222 nameserver 208.67.220.220 nameserver 4.2.2.1 nameserver 4.2.2.2 nameserver 4.2.2.3 nameserver 4.2.2.4 nameserver 4.2.2.5 > > * check your network infrastructure. I had a few cases of clusters that > were giving sub-standard performance, only to find that e.g. cables were > flaky. In most cases though it's the network equipment such as switches > and routers - check their CPU usage, and the number of dropped packets. > Some entry-level switches and routers, even though their interfaces > nominally support gigabit speeds, their switching fabric and/or CPU > don't support high packet rates - so they would peg at 100% cpu, and > even if they don't show any lost packets, a 'ping -f' shows they can't > handle the load. Cables, what can I say about cables. We bought cat6 cables that when you wiggle them (or they get moved around) decide to reset the network card. I would have never believed that was possible. Changing the cables to 5e fixed the problem. Weird. There is company called trendnet that sells 24 port gigabit switches for around 300US. So we recently switched to gigabit switches as all of our network cards are gigabit. There was actually a problem with the eeprom on intel e1000 network cards that was causing connections to just drop on gigabit speeds but not 100Mb speeds. But there is a script to fix that and since we did the connection rate for gigabit is awesome. We are able to sustain over 50MB/s on direct file transfers. I think this is pretty much the hard disk limit. While it may take some time to get going I fully recommend gigabit infrastructure. > > * check OS level resource limits (ulimit -a on POSIX systems). In one > installation we were experiencing weird crashes and finally discovered > that datanodes and tasktrackers were hitting OS-wide limits of open file > handles. In another installation the OS-wide limits were ok, but the > limits on this particular account were insufficient. > > /var/log/messages is your friend ;). I think many people don't realize when getting into search engines that it is as much about hardware and system knowledge as it is about software. Dennis Kubes ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
