[ https://issues.apache.org/jira/browse/MAPREDUCE-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492458#comment-13492458 ]
Robert Joseph Evans commented on MAPREDUCE-4775: ------------------------------------------------ OK so I missed some of the code in shuffleScheduler.checkReducerHealth(). The stall check is in there, but the previous check for a single map attempt is completely useless at this point. Dropping the severity accordingly. Robert Joseph Evans added a comment. I am also confused why a reducer could be stalled for over an hour (MAPREDUCE-4772) and not be killed. I will look into that here too. > Reducer will "never" commit suicide > ----------------------------------- > > Key: MAPREDUCE-4775 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4775 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Critical > > In 1.0 there are a number of conditions that will cause a reducer to commit > suicide and exit. > This includes if it is stalled, if the error percentage of total fetches is > too high. In the new code it will only commit suicide when the total number > of failures for a single task attempt is >= max(30, totalMaps/10). In the > best case with the quadratic back-off to get a single map attempt to reach 30 > failure it would take 20.5 hours. And unless there is only one reducer > running the map task would have been restarted before then. > We should go back to include the same reducer suicide checks that are in 1.0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira