[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116244#comment-14116244
 ] 

Junping Du commented on MAPREDUCE-5891:
---------------------------------------

Thanks [~jlowe] for comments!
bq. SHUFFLE_FETCH_TIMEOUT_MS should be 
"mapreduce.reduce.shuffle.fetch.retry.timeout-ms"
Nice catch, done.

bq. openConnectionWithRetry calls abortConnect if stopped, but the one caller 
of this function does the same thing when it returns. Maybe 
openConnectionWithRetry should just return if stopped?
Yes. Even caller can return directly as caller from upper layer already address 
it. Fixed.

bq. Nit: The code block in copyMapOutput's catch of IOException is getting 
really long. It would be good to refactor some of this code into methods. Minor 
nit: "get failed" should be "failed".
Done.

bq. openConnectionWithRetry is being called and retries even if fetch retry is 
disabled
Good point, fixed.

bq. Shouldn't we be setting retryStartTime back to zero instead of endTime 
below?
Also good one, fixed it. 

bq. Also wondering if we should reset it after each successful transfer (e.g.: 
after a successful header parse and successful shuffle)?
May not be necessary. If retryStartTime is not 0, which means this fetcher 
haven't successfully make any progress since last failure of getMapOutput, it 
should keep trying and wait time aggregation until timeout. 

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to