[ 
https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635720#action_12635720
 ] 

Devaraj Das commented on HADOOP-4246:
-------------------------------------

{code}
+                if ((fetchFailedMaps.size() >= maxFailedUniqueFetches)
                     && !reducerHealthy 
                     && (!reducerProgressedEnough || reducerStalled)) { 
                   LOG.fatal("Shuffle failed with too many fetch failures " + 
{code}

The expression above should include (fetchFailedMaps.size() == 
numPendingFetches) to take care of cases where a reducer node becomes faulty 
towards the end of the shuffle.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when 
> maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less 
> than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to