[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127043#comment-14127043 ]
Jason Lowe commented on MAPREDUCE-5891: --------------------------------------- Thanks for updating the patch, Junping, and sorry for the delay in re-review. The fixes all look fine. I agree with Ming that we should be consistent about the default state of this feature and NM restart, although I'm not a fan of adding a YARN API to query NM restart. Task containers currently don't talk with the NM, and IMHO this is not a good enough reason to change that. I'm OK with adding it to the shuffle protocol if we can do it in a backwards-compatible way, although I don't know offhand how that would be accomplished. Another approach is to try to tie the two properties together and have the default value of mapreduce.reduce.shuffle.fetch.retry.enabled in mapred-default.xml be $\{yarn.nodemanager.recovery.enabled\}, so they could still be set independently but by default the NM restart setting drives the fetch retry setting. > Improved shuffle error handling across NM restarts > -------------------------------------------------- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, > MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)