[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115958#comment-14115958
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---------------------------------------

Thanks for updating the patch!  Comments:

SHUFFLE_FETCH_TIMEOUT_MS = "mapreduce.reduce.shuffle.fetch.timeout-ms" but it 
should be "mapreduce.reduce.shuffle.fetch.retry.timeout-ms"

openConnectionWithRetry calls abortConnect if stopped, but the one caller of 
this function does the same thing when it returns.  Maybe 
openConnectionWithRetry should just return if stopped?

Nit: The code block in copyMapOutput's catch of IOException is getting really 
long.  It would be good to refactor some of this code into methods

Minor nit: "get failed" should be "failed".

openConnectionWithRetry is being called and retries even if fetch retry is 
disabled

Shouldn't we be setting retryStartTime back to zero instead of endTime below?  
Otherwise the next error could timeout without any retry if the transfer before 
the error took longer than the timeout interval.
{code}
      // Refresh retryStartTime as map task make progress if retried before.
      if (retryStartTime != 0) {
        retryStartTime = endTime;
      }
{code}
Also wondering if we should reset it after each successful transfer (e.g.: 
after a successful header parse and successful shuffle)?


> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to