My cluster has 27 nodes with a total reduce task capacity of 54. The job had
31 reducers. I actually had a task today that showed the behavior you're
describing: 3 tries on one machine, and then the 4th on a different one.

As for the particular job I was talking about before:

Here are the stats for the job:

Kind    Total Tasks(successful+failed+killed)    Successful tasks    Failed
tasks    Killed tasks    Start Time    Finish Time
Setup     1     1     0     0     4-Apr-2009 00:30:16     4-Apr-2009
00:30:33 (17sec)
Map     64     49     12     3     4-Apr-2009 00:30:33     4-Apr-2009
01:11:15 (40mins, 41sec)
Reduce     34     30     4     0     4-Apr-2009 00:30:44     4-Apr-2009
04:31:36 (4hrs, 52sec)
Cleanup     4     0     4     0     4-Apr-2009 04:31:36     4-Apr-2009
06:32:00 (2hrs, 24sec)


Not sure what to look for in the jobtracker log. All it shows for that
particular failed task is that it assigned it to the same machine 4 times
and then eventually failed. Perhaps something to note is that the 4 failures
were all due to timeouts:

"Task attempt_200904031942_0002_r_000013_3 failed to report status for 1802
seconds. Killing!"

Also, looking at the logs, there was a map task too that was retried on that
particuar box 4 times without going to a different one. Perhaps it had
something to do with the way this machine failed: The jobtracker still
considered it live, while all actual tasks assigned to it timed out.

-- Stefan



> From: Amar Kamat <ama...@yahoo-inc.com>
> Reply-To: <core-user@hadoop.apache.org>
> Date: Tue, 07 Apr 2009 10:05:16 +0530
> To: <core-user@hadoop.apache.org>
> Subject: Re: Reduce task attempt retry strategy
> 
> Stefan Will wrote:
>> Hi,
>> 
>> I had a flaky machine the other day that was still accepting jobs and
>> sending heartbeats, but caused all reduce task attempts to fail. This in
>> turn caused the whole job to fail because the same reduce task was retried 3
>> times on that particular machine.
>>   
> What is your cluster size? If a task fails on a machine then its
> re-tried on some other machine (based on number of good machines left in
> the cluster). After certain number of failures, the machine will be
> blacklisted (again based on number of machine left in the cluster). 3
> different reducers might be scheduled on that machine but that should
> not lead to job failure. Can you explain in detail what exactly
> happened. Find out where the attempts got scheduled from the
> jobtracker's log.
> Amar
>> Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
>> always thought that the framework would retry jobs on a different machine if
>> retries on the original machine keep failing. E.g. I would have expected to
>> retry once or twice on the same machine, but then switch to a different one
>> to minimize the likelihood of getting stuck on a bad machine.
>> 
>> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
>> improving on this in the future ?
>> 
>> Thanks,
>> Stefan
>> 
>>   


Reply via email to