I'm encountering a sporadic error while running MapReduce jobs, it
shows up in the console output as follows:

12/08/21 14:56:05 INFO mapred.JobClient: Task Id :
attempt_201208211430_0001_m_003538_0, Status : FAILED
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/08/21 14:56:05 WARN mapred.JobClient: Error reading task
outputhttp://<hostname_removed>:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stdout
12/08/21 14:56:05 WARN mapred.JobClient: Error reading task
outputhttp://<hostname_removed>:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stderr

The conditions look exactly like those described in:
https://issues.apache.org/jira/browse/MAPREDUCE-4003

Unfortunately, this issue is marked as closed for Apache Hadoop
version 1.0.3, but that's the version that I'm running into this issue
with.

There does seem to be a correlation between the frequency of these
errors and the number of concurrent map tasks being executed, however
the hardware resources on the cluster do not appear to be near their
limits. I'm assuming that there is a knob somewhere that is
maladjusted that is causing this error, however I haven't found it.

I did find this discussion
(https://groups.google.com/a/cloudera.org/d/topic/cdh-user/NlhvHapf3pk/discussion)
on CDH users list describing the exact same problem and the advice was
to increase the value of the mapred.child.ulimit setting. However, I
had this value initially unset, which should mean that the value is
unlimited if my research is correct. Then I set the value to 3 GB (3x
my setting for mapred.map.child.java.opts) and it still did not
resolve the problem. Finally, out of frustration, I just added a zero
at the end and now the value is 31457280 (the unit for the setting is
in KB) which is 30GB. I'm still having the problem.

Is anybody else seeing this issue or have an idea for a workaround?
Right now my workaround is to set the allowed failures to be very high
before a tasktracker is blacklisted, but this has the unintended side
effect of taking a very long time to evict legitimately messed up
tasktrackers. If this error is indicative of some other configuration
problem, I'd like to try to resolve it.

Ideas? Or should I re-open the JIRA?

Thank you for your time,
Matt

Reply via email to