[ https://issues.apache.org/jira/browse/HADOOP-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546890#comment-13546890 ]
Jeremy Karn commented on HADOOP-9184: ------------------------------------- I'm not sure what's wrong with the patch. It seems to apply fine when I do it locally. > Some reducers failing to write final output file to s3. > ------------------------------------------------------- > > Key: HADOOP-9184 > URL: https://issues.apache.org/jira/browse/HADOOP-9184 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 0.20.2 > Reporter: Jeremy Karn > Attachments: example.pig, HADOOP-9184-branch-0.20.patch, > hadoop-9184.patch, task_log.txt > > > We had a Hadoop job that was running 100 reducers with most of the reducers > expected to write out an empty file. When the final output was to an S3 > bucket we were finding that sometimes we were missing a final part file. > This was happening approximately 1 job in 3 (so approximately 1 reducer out > of 300 was failing to output the data properly). I've attached the pig script > we were using to reproduce the bug. > After an in depth look and instrumenting the code we traced the problem to > moveTaskOutputs in FileOutputCommitter. > The code there looked like: > {code} > if (fs.isFile(taskOutput)) { > … do stuff … > } else if(fs.getFileStatus(taskOutput).isDir()) { > … do stuff … > } > {code} > And what we saw happening is that for the problem jobs neither path was being > exercised. I've attached the task log of our instrumented code. In this > version we added an else statement and printed out the line "THIS SEEMS LIKE > WE SHOULD NEVER GET HERE …". > The root cause of this seems to be an eventual consistency issue with S3. > You can see in the log that the first time moveTaskOutputs is called it finds > that the taskOutput is a directory. It goes into the isDir() branch and > successfully retrieves the list of files in that directory from S3 (in this > case just one file). This triggers a recursive call to moveTaskOutputs for > the file found in the directory. But in this pass through moveTaskOutput the > temporary output file can't be found resulting in both branches of the above > if statement being skipped and the temporary file never being moved to the > final output location. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira