Re: Reduce task attempt retry strategy

Jothi Padmanabhan Mon, 13 Apr 2009 06:31:24 -0700

Currently, only failed tasks are attempted on a node other than the one
where it failed. For killed tasks, there is no such policy for retries.


"failed to report status" usually indicates that the task did not report
sufficient progress. However, it is possible that the task itself was not
progressing fast enough because the machine where it ran had problems.


On 4/8/09 12:33 AM, "Stefan Will" <stefan.w...@gmx.net> wrote:

> My cluster has 27 nodes with a total reduce task capacity of 54. The job had
> 31 reducers. I actually had a task today that showed the behavior you're
> describing: 3 tries on one machine, and then the 4th on a different one.
> 
> As for the particular job I was talking about before:
> 
> Here are the stats for the job:
> 
> Kind    Total Tasks(successful+failed+killed)    Successful tasks    Failed
> tasks    Killed tasks    Start Time    Finish Time
> Setup     1     1     0     0     4-Apr-2009 00:30:16     4-Apr-2009
> 00:30:33 (17sec)
> Map     64     49     12     3     4-Apr-2009 00:30:33     4-Apr-2009
> 01:11:15 (40mins, 41sec)
> Reduce     34     30     4     0     4-Apr-2009 00:30:44     4-Apr-2009
> 04:31:36 (4hrs, 52sec)
> Cleanup     4     0     4     0     4-Apr-2009 04:31:36     4-Apr-2009
> 06:32:00 (2hrs, 24sec)
> 
> 
> Not sure what to look for in the jobtracker log. All it shows for that
> particular failed task is that it assigned it to the same machine 4 times
> and then eventually failed. Perhaps something to note is that the 4 failures
> were all due to timeouts:
> 
> "Task attempt_200904031942_0002_r_000013_3 failed to report status for 1802
> seconds. Killing!"
> 
> Also, looking at the logs, there was a map task too that was retried on that
> particuar box 4 times without going to a different one. Perhaps it had
> something to do with the way this machine failed: The jobtracker still
> considered it live, while all actual tasks assigned to it timed out.
> 
> -- Stefan
> 
> 
> 
>> From: Amar Kamat <ama...@yahoo-inc.com>
>> Reply-To: <core-user@hadoop.apache.org>
>> Date: Tue, 07 Apr 2009 10:05:16 +0530
>> To: <core-user@hadoop.apache.org>
>> Subject: Re: Reduce task attempt retry strategy
>> 
>> Stefan Will wrote:
>>> Hi,
>>> 
>>> I had a flaky machine the other day that was still accepting jobs and
>>> sending heartbeats, but caused all reduce task attempts to fail. This in
>>> turn caused the whole job to fail because the same reduce task was retried 3
>>> times on that particular machine.
>>>   
>> What is your cluster size? If a task fails on a machine then its
>> re-tried on some other machine (based on number of good machines left in
>> the cluster). After certain number of failures, the machine will be
>> blacklisted (again based on number of machine left in the cluster). 3
>> different reducers might be scheduled on that machine but that should
>> not lead to job failure. Can you explain in detail what exactly
>> happened. Find out where the attempts got scheduled from the
>> jobtracker's log.
>> Amar
>>> Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
>>> always thought that the framework would retry jobs on a different machine if
>>> retries on the original machine keep failing. E.g. I would have expected to
>>> retry once or twice on the same machine, but then switch to a different one
>>> to minimize the likelihood of getting stuck on a bad machine.
>>> 
>>> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
>>> improving on this in the future ?
>>> 
>>> Thanks,
>>> Stefan
>>> 
>>>   
> 
>

Re: Reduce task attempt retry strategy

Reply via email to