Spot on!

The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two 
reducers and only 60MB for datanode and tasktracker daemons. I had already 
accounted for Java processes taking more than -Xmx of physical RAM, it seems 
an edge case caused the mappers to take a bit more RAM than usual. Then when 
the reducers start (all at the same time) it was not enough.

Thanks!

> Markus,
> 
> Thanks for clarifying! What heap size does this task launch with and what's
> the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM
> killer taking part here?
> 
> On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:
> > Hi,
> > 
> > They were indeed reported as failed and speculative execution is disabled
> > because this particular job is not side-effect free. We've seen this
> > before but it's rare.
> > 
> > Thanks
> > 
> >> Exit code 137 would mean a SIGKILL. Are you positive its reported as a
> >> failed task and not as a killed task? If not, it may just be a result of
> >> speculative reducer execution at work and is nothing to worry about.
> >> 
> >> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
> >>> Hi,
> >>> 
> >>> We sometimes see reducers fail just when all mappers are finishing. All
> >>> mappers finish roughly at the same time. The reducers only dump the
> >>> following exception:
> >>> 
> >>> java.lang.Throwable: Child Error
> >>> 
> >>>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> >>> 
> >>> Caused by: java.io.IOException: Task process exit with nonzero status
> >>> of 137.
> >>> 
> >>>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> >>> 
> >>> The reducers own log output also shows nothing that gives a clue, this
> >>> is the last part of the log:
> >>> 
> >>> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool:
> >>> Got brand-new decompressor
> >>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
> >>> Got brand-new decompressor
> >>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
> >>> Got brand-new decompressor
> >>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
> >>> on- disk files
> >>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
> >>> on- disk files
> >>> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
> >>> in memory files
> >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s)
> >>> where 0 is already in progress
> >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
> >>> Map Completion Events
> >>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s)
> >>> where 0 is already in progress
> >>> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
> >>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
> >>> and0 dup hosts)
> >>> 
> >>> Is there any advice?
> >>> Thanks

Reply via email to