[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116244#comment-14116244 ]
Junping Du commented on MAPREDUCE-5891: --------------------------------------- Thanks [~jlowe] for comments! bq. SHUFFLE_FETCH_TIMEOUT_MS should be "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" Nice catch, done. bq. openConnectionWithRetry calls abortConnect if stopped, but the one caller of this function does the same thing when it returns. Maybe openConnectionWithRetry should just return if stopped? Yes. Even caller can return directly as caller from upper layer already address it. Fixed. bq. Nit: The code block in copyMapOutput's catch of IOException is getting really long. It would be good to refactor some of this code into methods. Minor nit: "get failed" should be "failed". Done. bq. openConnectionWithRetry is being called and retries even if fetch retry is disabled Good point, fixed. bq. Shouldn't we be setting retryStartTime back to zero instead of endTime below? Also good one, fixed it. bq. Also wondering if we should reset it after each successful transfer (e.g.: after a successful header parse and successful shuffle)? May not be necessary. If retryStartTime is not 0, which means this fetcher haven't successfully make any progress since last failure of getMapOutput, it should keep trying and wait time aggregation until timeout. > Improved shuffle error handling across NM restarts > -------------------------------------------------- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, > MAPREDUCE-5891.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.2#6252)