[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118290#comment-14118290
 ] 

Ming Ma commented on MAPREDUCE-5891:
------------------------------------

Thanks, Junping, Jason for the useful patch.

In the case slowstart is set to some small value, the reducer will fetch some 
mapper output and wait for the rest. Is it possible Fetcher.retryStartTime is 
set to some old value due to early NM host A restart, and thus mark fetcher 
retry timed out when it later tries to handle NM host B restart?

To make sure fetcher doesn't unnecessarily retry for the decommission scenario, 
it seems the assumption is we will have some sort of graceful decommission 
support so that during decommission process the fetcher will still be able to 
get mapper output. Is it true?

If we get time to do YARN-1593, that will further reduce the chance of shuffle 
handler restart. Any opinion on that?

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to