[
https://issues.apache.org/jira/browse/HADOOP-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544601
]
Srikanth Kakani commented on HADOOP-2220:
-----------------------------------------
Map side problem: HADOOP-2247, the same formula mentioned there should work in
this case as well.
> Reduce tasks fail too easily because of repeated fetch failures
> ---------------------------------------------------------------
>
> Key: HADOOP-2220
> URL: https://issues.apache.org/jira/browse/HADOOP-2220
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.0
> Reporter: Christian Kunz
>
> Currently reduce tasks with more than MAX_FAILED_UNIQUE_FETCHES (= 5
> hard-coded) failures to fetch output from different mappers will fail (I
> believe, introduced in HADOOP-1158)
> This gives us some problems with longer running jobs with a large number of
> mappers in multiple waves:
> Otherwise problem-less reduce tasks fail because of too many fetch failures
> due to resource contention, and new reduce tasks have to fetch all data from
> the already successfully executed mappers, introducing a lot of additional IO
> overhead. Also, the job will fail when the same reducer exhausts the maximum
> number of attempts.
> The limit should be a function of number of mappers and/or waves of mappers,
> and should be more conservative (e.g. no need to let them fail when there are
> enough slots to start speculatively executed reducers and speculative
> execution is enabled). Also, we might consider to not count such a restart
> against the number of attempts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.