[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111538#comment-14111538 ]
Jason Lowe commented on MAPREDUCE-5891: --------------------------------------- Thanks for updating the patch, Junping. Comments: DEFAULT_SHUFFLE_FETCH_* should be in MRJobConfig openConnectionWithRetry should not ignore InterruptedException. Fetchers are shutdown by being interrupted, so I think minimally we should check for stopped==true if one occurs and act accordingly. We log a WARN when we can retry but only an INFO when we failed to read a map header and are not retrying. That seems backwards. Also the message logged when we can't retry is a lot more informative than the one when we can. We are retrying one more time when we're past the retry timeout which could result in a significantly longer time to discover fetch failures that aren't NM restart-related. This is also inconsistent with how openConnectionWithRetry behaves. Only the retry enabled property was added to mapred-default.xml. We should also add the other two properties with their defaults and appropriate descriptions for documentation. There should be a unit test to verify fetch errors can still be reported even with retry enabled, as it's important that we don't break the ability to recover from errors not related to NM restart. Nit: mapreduce.reduce.shuffle.fetch.interval-ms should be mapreduce.reduce.shuffle.fetch.retry.interval-ms to clearly indicate this is an interval only applicable for fetch retry. Similarly mapreduce.reduce.shuffle.fetch.timeout-ms should be mapreduce.reduce.shuffle.fetch.retry.timeout-ms. Nit: "which means it haven't retried yet." should be "which means it hasn't retried yet." > Improved shuffle error handling across NM restarts > -------------------------------------------------- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.2#6252)