[ https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt Foley closed MAPREDUCE-3184. --------------------------------- Closed upon release of 1.0.1. > Improve handling of fetch failures when a tasktracker is not responding on > HTTP > ------------------------------------------------------------------------------- > > Key: MAPREDUCE-3184 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker > Affects Versions: 0.20.205.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Fix For: 1.0.1 > > Attachments: mr-3184.txt > > > On a 100 node cluster, we had an issue where one of the TaskTrackers was hit > by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed > was the following: > - every reducer would try to fetch the same map task, and fail after ~13 > minutes. > - At that point, all reducers would report this failed fetch to the JT for > the same task, and the task would be re-run. > - Meanwhile, the reducers would move on to the next map task that ran on the > TT, and hang for another 13 minutes. > The job essentially made no progress for hours, as each map task that ran on > the bad node was serially marked failed. > To combat this issue, we should introduce a second type of failed fetch > notification, used when the TT does not respond at all (ie > SocketTimeoutException, etc). These fetch failure notifications should count > against the TT at large, rather than a single task. If more than half of the > reducers report such an issue for a given TT, then all of the tasks from that > TT should be re-run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira