My cluster has 27 nodes with a total reduce task capacity of 54. The job had 31 reducers. I actually had a task today that showed the behavior you're describing: 3 tries on one machine, and then the 4th on a different one.
As for the particular job I was talking about before: Here are the stats for the job: Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks Start Time Finish Time Setup 1 1 0 0 4-Apr-2009 00:30:16 4-Apr-2009 00:30:33 (17sec) Map 64 49 12 3 4-Apr-2009 00:30:33 4-Apr-2009 01:11:15 (40mins, 41sec) Reduce 34 30 4 0 4-Apr-2009 00:30:44 4-Apr-2009 04:31:36 (4hrs, 52sec) Cleanup 4 0 4 0 4-Apr-2009 04:31:36 4-Apr-2009 06:32:00 (2hrs, 24sec) Not sure what to look for in the jobtracker log. All it shows for that particular failed task is that it assigned it to the same machine 4 times and then eventually failed. Perhaps something to note is that the 4 failures were all due to timeouts: "Task attempt_200904031942_0002_r_000013_3 failed to report status for 1802 seconds. Killing!" Also, looking at the logs, there was a map task too that was retried on that particuar box 4 times without going to a different one. Perhaps it had something to do with the way this machine failed: The jobtracker still considered it live, while all actual tasks assigned to it timed out. -- Stefan > From: Amar Kamat <ama...@yahoo-inc.com> > Reply-To: <core-user@hadoop.apache.org> > Date: Tue, 07 Apr 2009 10:05:16 +0530 > To: <core-user@hadoop.apache.org> > Subject: Re: Reduce task attempt retry strategy > > Stefan Will wrote: >> Hi, >> >> I had a flaky machine the other day that was still accepting jobs and >> sending heartbeats, but caused all reduce task attempts to fail. This in >> turn caused the whole job to fail because the same reduce task was retried 3 >> times on that particular machine. >> > What is your cluster size? If a task fails on a machine then its > re-tried on some other machine (based on number of good machines left in > the cluster). After certain number of failures, the machine will be > blacklisted (again based on number of machine left in the cluster). 3 > different reducers might be scheduled on that machine but that should > not lead to job failure. Can you explain in detail what exactly > happened. Find out where the attempts got scheduled from the > jobtracker's log. > Amar >> Perhaps I¹m confusing this with the block placement strategy in hdfs, but I >> always thought that the framework would retry jobs on a different machine if >> retries on the original machine keep failing. E.g. I would have expected to >> retry once or twice on the same machine, but then switch to a different one >> to minimize the likelihood of getting stuck on a bad machine. >> >> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for >> improving on this in the future ? >> >> Thanks, >> Stefan >> >>