Re: Reducers fail without messages on 20.205.0

Harsh J Mon, 26 Dec 2011 13:28:21 -0800

Markus,

Good to know you fixed it now :)


Also consider raising reduce slowstart to a more suitable value, like 80-85% -- 
this is pretty beneficial in both heap consumption and more performant Reduce 
slots. The property name is "mapred.reduce.slowstart.completed.maps" and 
carries a float value (percentage).

( P.s. You may also want to watch this general take on the topic: 
http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-hadoop-troubleshooting-101
 )

On 27-Dec-2011, at 2:29 AM, Markus Jelsma wrote:

> Spot on!
> 
> The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two 
> reducers and only 60MB for datanode and tasktracker daemons. I had already 
> accounted for Java processes taking more than -Xmx of physical RAM, it seems 
> an edge case caused the mappers to take a bit more RAM than usual. Then when 
> the reducers start (all at the same time) it was not enough.
> 
> Thanks!
> 
>> Markus,
>> 
>> Thanks for clarifying! What heap size does this task launch with and what's
>> the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM
>> killer taking part here?
>> 
>> On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:
>>> Hi,
>>> 
>>> They were indeed reported as failed and speculative execution is disabled
>>> because this particular job is not side-effect free. We've seen this
>>> before but it's rare.
>>> 
>>> Thanks
>>> 
>>>> Exit code 137 would mean a SIGKILL. Are you positive its reported as a
>>>> failed task and not as a killed task? If not, it may just be a result of
>>>> speculative reducer execution at work and is nothing to worry about.
>>>> 
>>>> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
>>>>> Hi,
>>>>> 
>>>>> We sometimes see reducers fail just when all mappers are finishing. All
>>>>> mappers finish roughly at the same time. The reducers only dump the
>>>>> following exception:
>>>>> 
>>>>> java.lang.Throwable: Child Error
>>>>> 
>>>>>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>>>> 
>>>>> Caused by: java.io.IOException: Task process exit with nonzero status
>>>>> of 137.
>>>>> 
>>>>>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>>> 
>>>>> The reducers own log output also shows nothing that gives a clue, this
>>>>> is the last part of the log:
>>>>> 
>>>>> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool:
>>>>> Got brand-new decompressor
>>>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
>>>>> Got brand-new decompressor
>>>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
>>>>> Got brand-new decompressor
>>>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
>>>>> on- disk files
>>>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
>>>>> on- disk files
>>>>> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
>>>>> in memory files
>>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s)
>>>>> where 0 is already in progress
>>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
>>>>> Map Completion Events
>>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s)
>>>>> where 0 is already in progress
>>>>> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
>>>>> and0 dup hosts)
>>>>> 
>>>>> Is there any advice?
>>>>> Thanks

Re: Reducers fail without messages on 20.205.0

Reply via email to