[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127043#comment-14127043
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---------------------------------------

Thanks for updating the patch, Junping, and sorry for the delay in re-review.   
 The fixes all look fine.

I agree with Ming that we should be consistent about the default state of this 
feature and NM restart, although I'm not a fan of adding a YARN API to query NM 
restart.  Task containers currently don't talk with the NM, and IMHO this is 
not a good enough reason to change that.  I'm OK with adding it to the shuffle 
protocol if we can do it in a backwards-compatible way, although I don't know 
offhand how that would be accomplished.  Another approach is to try to tie the 
two properties together and have the default value of 
mapreduce.reduce.shuffle.fetch.retry.enabled in mapred-default.xml be 
$\{yarn.nodemanager.recovery.enabled\}, so they could still be set 
independently but by default the NM restart setting drives the fetch retry 
setting.

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to