[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127212#comment-14127212
 ] 

Ming Ma commented on MAPREDUCE-5891:
------------------------------------

The patch looks good. I like Jason's idea to have 
mapreduce.reduce.shuffle.fetch.retry.enabled use 
${yarn.nodemanager.recovery.enabled} as default value. As for the other 
approaches,

a) dynamic MR to YARN query, given NM recovery flag is a global cluster level 
setting ( although it is possible to config it on per NM basis ), can we derive 
the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job submission 
time from some YARN API call to RM?

b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http 
header via property key names. So if we add a new property to indicate if 
recovery is supported and continue to keep the same http "version" property, 
new version of fetcher might be able to work with old version of 
shufflehandler, and vise versa.

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to