[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492458#comment-13492458
 ] 

Robert Joseph Evans commented on MAPREDUCE-4775:
------------------------------------------------

OK so I missed some of the code in shuffleScheduler.checkReducerHealth(). The 
stall check is in there, but the previous check for a single map attempt is 
completely useless at this point. Dropping the severity accordingly.
Robert Joseph Evans added a comment.  I am also confused why a reducer could be 
stalled for over an hour (MAPREDUCE-4772) and not be killed. I will look into 
that here too.

                
> Reducer will "never" commit suicide
> -----------------------------------
>
>                 Key: MAPREDUCE-4775
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4775
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>
> In 1.0 there are a number of conditions that will cause a reducer to commit 
> suicide and exit.
> This includes if it is stalled, if the error percentage of total fetches is 
> too high.  In the new code it will only commit suicide when the total number 
> of failures for a single task attempt is >= max(30, totalMaps/10).  In the 
> best case with the quadratic back-off to get a single map attempt to reach 30 
> failure it would take 20.5 hours.  And unless there is only one reducer 
> running the map task would have been restarted before then.
> We should go back to include the same reducer suicide checks that are in 1.0

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to