[ 
https://issues.apache.org/jira/browse/HADOOP-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HADOOP-2486:
--------------------------------

    Attachment: 2486.patch

Thanks Koji for this excellent bug report! I found a problem. Here's the 
scenario:

Thread1: 
1) The ramfs merge thread is waiting to do InMemoryFileSystem.getFiles()

Thread2:
1) The ReduceTask locks itself.
2) It invokes rename for a file, F1, that it just shuffled, in the ramfs.
3) The rename in the ChecksumFileSystem gets called and finishes renaming the 
actual file but before it could not do rename of the checksum file, thread 
switch happened. Thread1 gets control.

Thread1:
2) It now calls InMemoryFileSystem.getFiles(), and initiates merge. In the 
process of merge, it deletes the ramfs file, F1. Note that the checksum file is 
optional and if it is not found, ChecksumFSInputChecker silently ignores it.

At some point later on, Thread2 gets control and tries to do rename of the 
checksum file and since the real file is not there anymore, the second call to 
isDirectory in ChecksumFileSystem.rename results in a NPE (as in the reported 
stack trace).

Attached is a patch that will address the above problem. Just makes the call to 
getFiles synchronized on the ReduceTask object.

> Dropping records at reducer.  InMemoryFileSystem NPE.
> -----------------------------------------------------
>
>                 Key: HADOOP-2486
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2486
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.3
>            Reporter: Koji Noguchi
>         Attachments: 2486.patch
>
>
> Note: I'm really not sure if this is a bug in my code or in mapred. 
> With my mapreduce job without combiner,  I sometimes see   # of total Map 
> output records != # of total Reduce input records. What's weird to me is, 
> when I rerun my code with exact same input, usually I get an expected #map 
> output recs == #reduce output recs.
> Both jobs finish successfully. No failed tasks. No speculative execution. 
> I ran separate linecount mapred jobs on both the input and the output to see 
> if  the counters are reporting the correct number. 
> When I looked at all the 513 reducer counter, I found single reducer with 
> different counts for the two runs. 
> Only error stood out in that  reducer userlog is, 
> {noformat} 
> 2007-12-22 00:19:07,640 INFO org.apache.hadoop.mapred.ReduceTask: 
> task_200712220008_0003_r_000024_0 done copying 
> task_200712220008_0003_m_000288_0 output from qqq856.ppp.com.
> 2007-12-22 00:19:07,640 INFO org.apache.hadoop.mapred.ReduceTask: 
> task_200712220008_0003_r_000024_0 Copying task_200712220008_0003_m_000327_0 
> output from qqq887.ppp.com.
> 2007-12-22 00:19:07,640 ERROR org.apache.hadoop.mapred.ReduceTask: Map output 
> copy failure: java.lang.NullPointerException
>       at 
> org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:366)
>       at 
> org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryFileStatus.(InMemoryFileSystem.java:380)
>       at 
> org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getFileStatus(InMemoryFileSystem.java:283)
>       at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:423)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:386)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:716)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)
> 2007-12-22 00:19:07,641 INFO org.apache.hadoop.mapred.ReduceTask: 
> task_200712220008_0003_r_000024_0 done copying 
> task_200712220008_0003_m_000228_0 output from qqq801.ppp.com.
> 2007-12-22 00:19:07,641 INFO org.apache.hadoop.mapred.ReduceTask: 
> task_200712220008_0003_r_000024_0 Copying task_200712220008_0003_m_000337_0 
> output from qqq841.ppp.com.
> {noformat} 
> Could this error be somehow related to my having different # of records? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to