[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110947#comment-14110947
 ] 

Junping Du commented on MAPREDUCE-5891:
---------------------------------------

Thanks [~jlowe] for review! I just updated the patch with addressing most of 
your previous comments and add unit test. Please help to review it again, Thx!

bq. I was under the impression that the copyMapOutput retry could cause a 
reconnect which itself would have retries. If that's not the case then there's 
no issue with nested retries.
If copyMapOutput throw exception during NM restart, then it will firstly go to 
reconnect with retry as in most cases the connect will get failed except we 
tolerant sometime to wait NM get recovered. We can also give up retry in 
connect, but the logic will be more complexity as something like following, 
which may not be necessary?
{code}
while (...) {
  try {
     failedTasks = copyMapOutput(...);
  } catch (IOException e) {
     try {
         connect(...);        
      } catch {
        // do nothing, back to the loop.
      }
   }
{code}

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to