Hi Matt,

You are most probably seeing this 
https://issues.apache.org/jira/browse/MAPREDUCE-2374 

There is a single line fix for this issue. See the latest patch attached to the 
above JIRA entry.

-Shrinivas

-----Original Message-----
From: Matt Kennedy [mailto:stinkym...@gmail.com] 
Sent: Tuesday, August 21, 2012 2:15 PM
To: user@hadoop.apache.org
Subject: Map Reduce "Child Error" task failure

I'm encountering a sporadic error while running MapReduce jobs, it shows up in 
the console output as follows:

12/08/21 14:56:05 INFO mapred.JobClient: Task Id :
attempt_201208211430_0001_m_003538_0, Status : FAILED
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/08/21 14:56:05 WARN mapred.JobClient: Error reading task 
outputhttp://<hostname_removed>:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stdout
12/08/21 14:56:05 WARN mapred.JobClient: Error reading task 
outputhttp://<hostname_removed>:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stderr

The conditions look exactly like those described in:
https://issues.apache.org/jira/browse/MAPREDUCE-4003

Unfortunately, this issue is marked as closed for Apache Hadoop version 1.0.3, 
but that's the version that I'm running into this issue with.

There does seem to be a correlation between the frequency of these errors and 
the number of concurrent map tasks being executed, however the hardware 
resources on the cluster do not appear to be near their limits. I'm assuming 
that there is a knob somewhere that is maladjusted that is causing this error, 
however I haven't found it.

I did find this discussion
(https://groups.google.com/a/cloudera.org/d/topic/cdh-user/NlhvHapf3pk/discussion)
on CDH users list describing the exact same problem and the advice was to 
increase the value of the mapred.child.ulimit setting. However, I had this 
value initially unset, which should mean that the value is unlimited if my 
research is correct. Then I set the value to 3 GB (3x my setting for 
mapred.map.child.java.opts) and it still did not resolve the problem. Finally, 
out of frustration, I just added a zero at the end and now the value is 
31457280 (the unit for the setting is in KB) which is 30GB. I'm still having 
the problem.

Is anybody else seeing this issue or have an idea for a workaround?
Right now my workaround is to set the allowed failures to be very high before a 
tasktracker is blacklisted, but this has the unintended side effect of taking a 
very long time to evict legitimately messed up tasktrackers. If this error is 
indicative of some other configuration problem, I'd like to try to resolve it.

Ideas? Or should I re-open the JIRA?

Thank you for your time,
Matt


Reply via email to