[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118727#comment-14118727
 ] 

Jason Lowe commented on MAPREDUCE-5891:
---------------------------------------

Thanks for updating the patch, Junping.  Comments:

I'm not sure the following log message adds that much value given the exception 
is going to be logged just a bit later.  A different log message appears when 
retry is enabled, so I think this just adds to the log length without adding a 
lot of valuable information.
{code}
+        if (!fetchRetryEnabled) {
+           LOG.warn("Failed to connect to host: " + url + 
+                    " when retry is not enabled");
+           throw e;
+        }
{code}

setupConnectionsWithRetry is now inconsistent when it comes to calling 
abortConnect() when {{stopped}} is true.

Nit:  The following code should simply be {{retryStartTime = 0;}}
{code}
+      if (retryStartTime != 0) {
+        retryStartTime = 0;
+      }
{code}


And to address some of Ming's comments:

bq. Is it possible Fetcher.retryStartTime is set to some old value due to early 
NM host A restart, and thus mark fetcher retry timed out when it later tries to 
handle NM host B restart?

Yes, that is totally possible.  Nice catch, Ming!
@Junping, the code needs to set retryStartTime to zero at the beginning of 
copyFromHost as well to make sure remnants from previous failures don't leak 
into new, fresh attempts.

bq. To make sure fetcher doesn't unnecessarily retry for the decommission 
scenario, it seems the assumption is we will have some sort of graceful 
decommission support so that during decommission process the fetcher will still 
be able to get mapper output.

Yes, see YARN-914.  If the decommission still takes place even though the 
application is still running and output still needs to be fetched from that 
node, the RM will inform the AM of the node being removed from the cluster.  
The AM will then inform the reducers that the map outputs on that node are 
obsolete and will reschedule map tasks on other nodes rather than have the 
reducers keep trying against a decomm'd node.

bq. If we get time to do YARN-1593, that will further reduce the chance of 
shuffle handler restart. Any opinion on that?

Yes that would be quite nice, although that seems like a very significant 
feature to implement in the 2.6 timeframe.  It creates new security, 
reacquisition, and failure issues, and while it does significantly reduce the 
scenarios where the ShuffleHandler will need to be restarted it doesn't, by 
itself, preclude the need to do so.  For example, if the ShuffleHandler has a 
bugfix that needs to be deployed we either need the ability for MapReduce to 
request different versions of an aux service and have multiple versions running 
simultaneously or we need to restart the ShuffleHandler service.  The latter 
requires workarounds like this JIRA to avoid potential fetch failure storms 
when the ShuffleHandler service is down temporarily.

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to