[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111538#comment-14111538
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---------------------------------------

Thanks for updating the patch, Junping.  Comments:

DEFAULT_SHUFFLE_FETCH_* should be in MRJobConfig

openConnectionWithRetry should not ignore InterruptedException.  Fetchers are 
shutdown by being interrupted, so I think minimally we should check for 
stopped==true if one occurs and act accordingly.

We log a WARN when we can retry but only an INFO when we failed to read a map 
header and are not retrying. That seems backwards.  Also the message logged 
when we can't retry is a lot more informative than the one when we can.

We are retrying one more time when we're past the retry timeout which could 
result in a significantly longer time to discover fetch failures that aren't NM 
restart-related.  This is also inconsistent with how openConnectionWithRetry 
behaves.

Only the retry enabled property was added to mapred-default.xml.  We should 
also add the other two properties with their defaults and appropriate 
descriptions for documentation.

There should be a unit test to verify fetch errors can still be reported even 
with retry enabled, as it's important that we don't break the ability to 
recover from errors not related to NM restart.

Nit: mapreduce.reduce.shuffle.fetch.interval-ms should be 
mapreduce.reduce.shuffle.fetch.retry.interval-ms to clearly indicate this is an 
interval only applicable for fetch retry.  Similarly 
mapreduce.reduce.shuffle.fetch.timeout-ms should be 
mapreduce.reduce.shuffle.fetch.retry.timeout-ms.

Nit: "which means it haven't retried yet." should be "which means it hasn't 
retried yet."

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to