[ 
https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542552
 ] 

Devaraj Das commented on HADOOP-1984:
-------------------------------------

After some corridor discussion between me, Arun and Amar, here are some 
thoughts:
1) The task completion event has an additional field that says how much time a 
given map took to complete.

2) The longer the map completion time, the more we delay the feedback about it 
(in case reducers fail to fetch) to the JobTracker. That is, the killing of 
maps is not based on the number of times the attempt to fetch failed, but 
instead dependent on the time the map will take to run if reexecuted.

3) For example, if a map takes 30 minutes to run, and the fetch for the 
corresponding output fails, a reduce postpones giving the feedback to the 
JobTracker until it has tried for 15 minutes or so (exponential backoff within 
this time interval). In other words, we increase the number of retries for long 
running maps just so that we might be successful in fetching probabilistically. 
The time a reduce spends retrying is directly proportional to the time the map 
took to complete.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely 
> slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of 
> otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to