[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538071#comment-14538071
 ] 

Junping Du commented on MAPREDUCE-6361:
---------------------------------------

NPE get throw in copyFailed() in ShuffleSchedulerImpl.java:267:
{code}
"boolean hostFail = hostFailures.get(hostname).get() > getMaxHostFailures() ? 
true : false;"
{code} 
It means hostFailures doesn't include hostname that just failed, which is not 
expected because we call hostFailed() to put host into hostFailures before 
anytime to call copyFailed():
{code}
        scheduler.hostFailed(host.getHostName());
        for(TaskAttemptID left: failedTasks) {
          scheduler.copyFailed(left, host, true, false);
        }
{code}
Although hostFailed() and copyFailed() are both synchronized method (so as 
copySucceeded()), it is still possible (like the only reason) to cause this NPE 
for the other thread calls copySucceeded() on the same host (for other map 
output) between we call hostFailed() and copyFailed() in this thread when 
taking care of one map output failure.
We need to fix this concurrent issue to get rid of NPE issue which failed map 
output copy directly without any retry.

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in 
> one thread and copyFailed() in another thread on the same host
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Junping Du
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#25
>          at 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:415)
>          at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>          at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>          at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>          at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to