[
https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635736#action_12635736
]
Devaraj Das commented on HADOOP-4163:
-------------------------------------
Today, if the copier thread (ReduceTask.ReduceCopier.MapOutputCopier.run())
throws a Throwable, it is logged an ignored. I am wondering whether it makes
sense to treat all exceptions except IOExceptions (mostly due to network
issues) as fatal. Here is one thought -
Rename mergeThrowable to shuffleThrowable. In the copier thread, we could set
shuffleThrowable when Throwable is caught (IOException is caught separately
already). In all the places where mergeThrowable is set, we could set
shuffleThrowable. The loop inside fetchOutputs could check whether
shuffleThrowable is non-null.
When fetchOutputs returns with a 'false', we could check whether the
shuffleThrowable is an instance of Error and if so, throw the Error out. In the
other case, we could wrap it in an IOException. Doing it in the above way would
mean that we call umbilical.fsError at exactly one place - in Child.main().
But I am slightly apprehensive about the implication of this change this late
in the game.. Thoughts ?
> If a reducer failed at shuffling stage, the task should fail, not just
> logging an exception
> -------------------------------------------------------------------------------------------
>
> Key: HADOOP-4163
> URL: https://issues.apache.org/jira/browse/HADOOP-4163
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.17.1
> Reporter: Runping Qi
> Assignee: Sharad Agarwal
> Priority: Blocker
> Fix For: 0.19.0
>
> Attachments: 4163_v1.patch, 4163_v2.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception
> logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output
> copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space
> left on device
> at
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> at
> org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:260)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> ... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error
> running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.