Re: Error: Too Many Fetch Failures

Ellis H. Wilson III Tue, 19 Jun 2012 12:33:31 -0700

On 06/19/12 13:38, Vinod Kumar Vavilapalli wrote:


Replies/more questions inline.

I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet and 
each having solely a single hard disk.  I am getting the following error 
repeatably for the TeraSort benchmark.  TeraGen runs without error, but 
TeraSort runs predictably until this error pops up between 64% and 70% 
completion.  This doesn't occur for every execution of the benchmark, as about 
one out of four times that I run the benchmark it does run to completion 
(TeraValidate included).



How many containers are you running per node?

Per my attached config files, I specify thatyarn.nodemanager.resource.memory-mb = 3072, and the default /seems/ tobe set at 1024MB for maps and reducers, so I have 3 containers runningper node. I have verified that this indeed is the case in the webclient. Three of these 1GB "slots" in the cluster appear to be occupiedby something else during the execution of TeraSort, so I specify thatTeraGen create .5TB using 441 maps (3waves * (50nodes * 3containerslots- 3occupiedslots)), and TeraSort to use 147 reducers. This seems togive me the guarantees I had with Hadoop 1.0 that each node gets anequal number of reducers, and my job doesn't drag on due to stragglerreducers.

Clearly maps are getting killed because of fetch failures. Can you look at the 
logs of the NodeManager where this particular map task ran. That may have logs 
related to why reducers are not able to fetch map-outputs. It is possible that 
because you have only one disk per node, some of these nodes have bad or 
unfunctional disks and thereby causing fetch failures.

I will rerun and report the exact error messages from the NodeManagers.Can you give me more exacting advice on collecting logs of this sort,for as I mentioned I'm new to doing so with the new version of Hadoop?I have been looking in /tmp/logs and hadoop/logs, but perhaps there issomewhere else to look as well?

Last, I am certain this is not related to failing disks, as this exacterror occurs at much higher frequencies when I run Hadoop on a NAS box,which is the core of my research at the moment. Nevertheless, I postedto this list instead of Dev as this was on vanilla CentOS-5.5 machinesusing just the HDDs within each, and therefore should be a highlytypical setup. In particular, I see these errors coming from numerousnodes all at once, and the subset of nodes giving the problems are notrepeatable from one run to the next, though the resulting error is.

If that is the case, either you can offline these nodes or bump up 
mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the 
default is 10. There are other some tweaks which I can tell if you can find 
more details from your logs.

I'd prefer to not bump up maxfetchfailures, and would rather simply fixthe issue that is causing the fetch to fail in the beginning. Thisisn't a large cluster, having only 50 nodes, nor are the links (1gig) orstorage capabilities (1 sata drive) great or strange relative to anynormal installation. I have to assume here that I've mis-configuredsomething :(.


Best,

ellis

Re: Error: Too Many Fetch Failures

Reply via email to