Replies/more questions inline.
> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet > and each having solely a single hard disk. I am getting the following error > repeatably for the TeraSort benchmark. TeraGen runs without error, but > TeraSort runs predictably until this error pops up between 64% and 70% > completion. This doesn't occur for every execution of the benchmark, as > about one out of four times that I run the benchmark it does run to > completion (TeraValidate included). How many containers are you running per node? > Error at the CLI: > "12/06/10 11:17:50 INFO mapreduce.Job: map 100% reduce 64% > 12/06/10 11:20:45 INFO mapreduce.Job: Task Id : > attempt_1339331790635_0002_m_004337_0, Status : FAILED > Container killed by the ApplicationMaster. > > Too Many fetch failures.Failing the attempt Clearly maps are getting killed because of fetch failures. Can you look at the logs of the NodeManager where this particular map task ran. That may have logs related to why reducers are not able to fetch map-outputs. It is possible that because you have only one disk per node, some of these nodes have bad or unfunctional disks and thereby causing fetch failures. If that is the case, either you can offline these nodes or bump up mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the default is 10. There are other some tweaks which I can tell if you can find more details from your logs. HTH, +Vinod