[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539594#comment-14539594
 ] 

Junping Du commented on MAPREDUCE-6361:
---------------------------------------

There are basically two ways to fix the race condition here:
1. abstract following code into a synchronized method, so copySucceeded() would 
get blocked until copyFailed() finished.
{code}
scheduler.hostFailed(host.getHostName());
for(TaskAttemptID left: failedTasks) {
    scheduler.copyFailed(left, host, true, false);
}
{code}
This sounds like more performance impact on shuffle as failure in fetching map 
output on one thread will block copySucceeded() for other threads with longer 
time.

2. Update copyFailed() to have assumption that hostFailures could be cleanup in 
the other thread. In case of that, adding back host to hostFailed as the first 
time host failed.

Prefer the 2nd option which sounds more lightweight. Will deliver a quick patch 
soon.

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in 
> one thread and copyFailed() in another thread on the same host
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#25
>          at 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:415)
>          at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>          at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>          at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>          at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to