[ 
https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539081
 ] 

Amar Kamat commented on HADOOP-1984:
------------------------------------

Ant test passed on my system. This error is due to the hase test  
{{TestRegionServerExit}}. 
----
The change here is the backoff function used for retrying when the map output 
fetch fails. Currently we are using {{60 + random(0,300)}} sec as the backoff 
interval. By using exponential backoff the penalty for first few backoffs is 
not much but then for the later ones the penalty is huge. The initial backoff 
is 2 sec and the function is
{code}
backoff (n) = init_value * base^(n-1)
n = no of retries
base is set to 2
init_value is set to 2 sec
{code}
Any suggestions on the formulation of the backoff algorithm and the initial 
values ?

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely 
> slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of 
> otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to