Robert Joseph Evans created MAPREDUCE-4772:
----------------------------------------------
Summary: Fetch failures can take way too long for a map to be
restarted
Key: MAPREDUCE-4772
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mrv2
Affects Versions: 0.23.4
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Critical
In one particular case we saw a NM go down at just the right time, that most of
the reducers got the output of the map tasks, but not all of them.
The ones that failed to get the output reported to the AM rather quickly that
they could not fetch from the NM, but because the other reducers were still
running the AM would not relaunch the map task because there weren't more than
50% of the running reducers that had reported fetch failures. Then because of
the exponential back-off for fetches on the reducers it took until 1 hour 45
min for the reduce tasks to hit another 10 fetch failures and report in again.
At that point the other reducers had finished and the job relaunched the map
task. If the reducers had still been running at 1:45 I have no idea how long
it would have taken for each of the tasks to get to 30 fetch failures.
We need to trigger the map based off of percentage of reducers shuffling, not
percentage of reducers running, we also need to have a maximum limit of the
back off, so that we don't ever have the reducer waiting for days to try and
fetch map output.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira