[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junping Du updated MAPREDUCE-5891: ---------------------------------- Attachment: MAPREDUCE-5891-v2.patch Thanks [~jlowe] for review and comments! In v2 patch, I addressed all your comments. bq. We are retrying one more time when we're past the retry timeout which could result in a significantly longer time to discover fetch failures that aren't NM restart-related. This is also inconsistent with how openConnectionWithRetry behaves. Nice catch. Move timeout judgement inside of copyMapOutput to see if throw exception for retry (before timeout) or get failed (reach to or after timeout). > Improved shuffle error handling across NM restarts > -------------------------------------------------- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, > MAPREDUCE-5891.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.2#6252)