Spot on! The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two reducers and only 60MB for datanode and tasktracker daemons. I had already accounted for Java processes taking more than -Xmx of physical RAM, it seems an edge case caused the mappers to take a bit more RAM than usual. Then when the reducers start (all at the same time) it was not enough.
Thanks! > Markus, > > Thanks for clarifying! What heap size does this task launch with and what's > the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM > killer taking part here? > > On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote: > > Hi, > > > > They were indeed reported as failed and speculative execution is disabled > > because this particular job is not side-effect free. We've seen this > > before but it's rare. > > > > Thanks > > > >> Exit code 137 would mean a SIGKILL. Are you positive its reported as a > >> failed task and not as a killed task? If not, it may just be a result of > >> speculative reducer execution at work and is nothing to worry about. > >> > >> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote: > >>> Hi, > >>> > >>> We sometimes see reducers fail just when all mappers are finishing. All > >>> mappers finish roughly at the same time. The reducers only dump the > >>> following exception: > >>> > >>> java.lang.Throwable: Child Error > >>> > >>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) > >>> > >>> Caused by: java.io.IOException: Task process exit with nonzero status > >>> of 137. > >>> > >>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) > >>> > >>> The reducers own log output also shows nothing that gives a clue, this > >>> is the last part of the log: > >>> > >>> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: > >>> Got brand-new decompressor > >>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: > >>> Got brand-new decompressor > >>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: > >>> Got brand-new decompressor > >>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging > >>> on- disk files > >>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging > >>> on- disk files > >>> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging > >>> in memory files > >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) > >>> where 0 is already in progress > >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling > >>> Map Completion Events > >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts > >>> and0 dup hosts) > >>> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts > >>> and0 dup hosts) > >>> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts > >>> and0 dup hosts) > >>> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) > >>> where 0 is already in progress > >>> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts > >>> and0 dup hosts) > >>> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask: > >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts > >>> and0 dup hosts) > >>> > >>> Is there any advice? > >>> Thanks