Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> So we moved 50 machines to a data center for a beta cluster of a new 
>> search engine based on Nutch and Hadoop.  We fired all of the machines 
>> up and started fetching and almost immediately started experiencing 
>> JVM crashes and checksum/IO errors which would cause jobs to fail, 
>> tasks to fail, and random data corruption.  After digging through and 
>> fixing the problems we have come up with some observations that may 
>> seem obvious but may also help someone else avoid the same problems.
> 
> [..]
> 
> Thanks Dennis for sharing this - it's very useful.
> 
> I could add also the following from my experience: for medium-large 
> scale crawling, i.e. in the order of 20-100 mln pages, be prepared to 
> address the following issues:
> 
> * take a crash course in advanced DNS setup ;) I found that often the 
> bottleneck lies in DNS and not just the raw bandwidth limits. If your 
> fetchlist consist of many unique hosts, then Nutch will fire thousands 
> of DNS requests per second. Using just an ordinary setup, i.e. without 
> caching, is pointless (most of the time the lookups will time out) and 
> harmful to the target DNS servers. You have to use a caching DNS - I 
> have good experiences with djbdns / tinydns, but they also require 
> careful tuning of max. number of requests, cache size, ignoring too 
> short TTLs, etc.

I completely agree although we use bind.  DNS issues were one of the 
first things that came up when we first started using Nutch and Hadoop 
over a year ago.  I remember that you pointed us toward caching DNS 
servers on the local machines at that time and that has made all of the 
difference.  Originally we were using a single DNS server in the domain 
and by running large fetches (many fetchers at the same time) were were 
causing a DOS attack on our own server.  And the memory on the server 
couldn't handle it so the entire fetch was slowing down and erroring.

I will add one point here and that is that while we run caching servers 
on each machine we also using large dns caches for our lookup 
nameservers such as opendns and verizon.  The idea being that if we 
don't have it, one of the large caches will and it is better to check 
them before going directly to global nameservers.  Large caches will 
take one hop while global nameservers will take two.  Here is what our 
resolv.conf looks like.  The 208 servers are OpenDNS while the 4.x 
servers are verizon.  Note is that both of these caches are open to 
requests from anywhere so anybody should be able to use them.

nameserver 127.0.0.1
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 4.2.2.1
nameserver 4.2.2.2
nameserver 4.2.2.3
nameserver 4.2.2.4
nameserver 4.2.2.5

> 
> * check your network infrastructure. I had a few cases of clusters that 
> were giving sub-standard performance, only to find that e.g. cables were 
> flaky. In most cases though it's the network equipment such as switches 
> and routers - check their CPU usage, and the number of dropped packets. 
> Some entry-level switches and routers, even though their interfaces 
> nominally support gigabit speeds, their switching fabric and/or CPU 
> don't support high packet rates - so they would peg at 100% cpu, and 
> even if they don't show any lost packets, a 'ping -f' shows they can't 
> handle the load.

Cables, what can I say about cables.  We bought cat6 cables that when 
you wiggle them (or they get moved around) decide to reset the network 
card.  I would have never believed that was possible.  Changing the 
cables to 5e fixed the problem.  Weird.

There is company called trendnet that sells 24 port gigabit switches for 
around 300US. So we recently switched to gigabit switches as all of our 
network cards are gigabit.  There was actually a problem with the eeprom 
on intel e1000 network cards that was causing connections to just drop 
on gigabit speeds but not 100Mb speeds.  But there is a script to fix 
that and since we did the connection rate for gigabit is awesome.  We 
are able to sustain over 50MB/s on direct file transfers.  I think this 
is pretty much the hard disk limit.  While it may take some time to get 
going I fully recommend gigabit infrastructure.

> 
> * check OS level resource limits (ulimit -a on POSIX systems). In one 
> installation we were experiencing weird crashes and finally discovered 
> that datanodes and tasktrackers were hitting OS-wide limits of open file 
> handles. In another installation the OS-wide limits were ok, but the 
> limits on this particular account were insufficient.
> 
> 
/var/log/messages is your friend ;).  I think many people don't realize 
when getting into search engines that it is as much about hardware and 
system knowledge as it is about software.

Dennis Kubes

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to