Replies/more questions inline.

> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet 
> and each having solely a single hard disk.  I am getting the following error 
> repeatably for the TeraSort benchmark.  TeraGen runs without error, but 
> TeraSort runs predictably until this error pops up between 64% and 70% 
> completion.  This doesn't occur for every execution of the benchmark, as 
> about one out of four times that I run the benchmark it does run to 
> completion (TeraValidate included).


How many containers are you running per node?


> Error at the CLI:
> "12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
> 12/06/10 11:20:45 INFO mapreduce.Job: Task Id : 
> attempt_1339331790635_0002_m_004337_0, Status : FAILED
> Container killed by the ApplicationMaster.
> 
> Too Many fetch failures.Failing the attempt


Clearly maps are getting killed because of fetch failures. Can you look at the 
logs of the NodeManager where this particular map task ran. That may have logs 
related to why reducers are not able to fetch map-outputs. It is possible that 
because you have only one disk per node, some of these nodes have bad or 
unfunctional disks and thereby causing fetch failures.

If that is the case, either you can offline these nodes or bump up 
mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the 
default is 10. There are other some tweaks which I can tell if you can find 
more details from your logs.

HTH,
+Vinod

Reply via email to