Improve handling of fetch failures when a tasktracker is not responding on HTTP
-------------------------------------------------------------------------------

                 Key: MAPREDUCE-3184
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: jobtracker
    Affects Versions: 0.20.205.0
            Reporter: Todd Lipcon


On a 100 node cluster, we had an issue where one of the TaskTrackers was hit by 
MAPREDUCE-2386 and stopped responding to fetches. The behavior observed was the 
following:
- every reducer would try to fetch the same map task, and fail after ~13 
minutes.
- At that point, all reducers would report this failed fetch to the JT for the 
same task, and the task would be re-run.
- Meanwhile, the reducers would move on to the next map task that ran on the 
TT, and hang for another 13 minutes.
The job essentially made no progress for hours, as each map task that ran on 
the bad node was serially marked failed.

To combat this issue, we should introduce a second type of failed fetch 
notification, used when the TT does not respond at all (ie 
SocketTimeoutException, etc). These fetch failure notifications should count 
against the TT at large, rather than a single task. If more than half of the 
reducers report such an issue for a given TT, then all of the tasks from that 
TT should be re-run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to