[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127309#comment-14127309
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---------------------------------------

bq. a) dynamic MR to YARN query, given NM recovery flag is a global cluster 
level setting ( although it is possible to config it on per NM basis ), can we 
derive the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job 
submission time from some YARN API call to RM?

The RM is unaware of whether the NM supports work-preserving restart, and I'd 
rather not add that coupling just for this.

bq. b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http 
header via property key names. So if we add a new property to indicate if 
recovery is supported and continue to keep the same http "version" property, 
new version of fetcher might be able to work with old version of 
shufflehandler, and vise versa.

True, we could add a new HTTP header that new Fetchers could query.

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to