[
https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635720#action_12635720
]
Devaraj Das commented on HADOOP-4246:
-------------------------------------
{code}
+ if ((fetchFailedMaps.size() >= maxFailedUniqueFetches)
&& !reducerHealthy
&& (!reducerProgressedEnough || reducerStalled)) {
LOG.fatal("Shuffle failed with too many fetch failures " +
{code}
The expression above should include (fetchFailedMaps.size() ==
numPendingFetches) to take care of cases where a reducer node becomes faulty
towards the end of the shuffle.
> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
> Key: HADOOP-4246
> URL: https://issues.apache.org/jira/browse/HADOOP-4246
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.19.0
> Reporter: Amareshwari Sriramadasu
> Assignee: Amareshwari Sriramadasu
> Priority: Blocker
> Fix For: 0.19.0
>
> Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when
> maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less
> than 4). This will not count reduce task copy errors to kill it eventually.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.