[ https://issues.apache.org/jira/browse/HADOOP-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy resolved HADOOP-2220. ----------------------------------- Resolution: Fixed Fixed as a part of HADOOP-2247. > Reduce tasks fail too easily because of repeated fetch failures > --------------------------------------------------------------- > > Key: HADOOP-2220 > URL: https://issues.apache.org/jira/browse/HADOOP-2220 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.16.0 > Reporter: Christian Kunz > Assignee: Amar Kamat > Priority: Blocker > Fix For: 0.15.2 > > > Currently reduce tasks with more than MAX_FAILED_UNIQUE_FETCHES (= 5 > hard-coded) failures to fetch output from different mappers will fail (I > believe, introduced in HADOOP-1158) > This gives us some problems with longer running jobs with a large number of > mappers in multiple waves: > Otherwise problem-less reduce tasks fail because of too many fetch failures > due to resource contention, and new reduce tasks have to fetch all data from > the already successfully executed mappers, introducing a lot of additional IO > overhead. Also, the job will fail when the same reducer exhausts the maximum > number of attempts. > The limit should be a function of number of mappers and/or waves of mappers, > and should be more conservative (e.g. no need to let them fail when there are > enough slots to start speculatively executed reducers and speculative > execution is enabled). Also, we might consider to not count such a restart > against the number of attempts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.