Markus, Good to know you fixed it now :)
Also consider raising reduce slowstart to a more suitable value, like 80-85% -- this is pretty beneficial in both heap consumption and more performant Reduce slots. The property name is "mapred.reduce.slowstart.completed.maps" and carries a float value (percentage). ( P.s. You may also want to watch this general take on the topic: http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-hadoop-troubleshooting-101 ) On 27-Dec-2011, at 2:29 AM, Markus Jelsma wrote: > Spot on! > > The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two > reducers and only 60MB for datanode and tasktracker daemons. I had already > accounted for Java processes taking more than -Xmx of physical RAM, it seems > an edge case caused the mappers to take a bit more RAM than usual. Then when > the reducers start (all at the same time) it was not enough. > > Thanks! > >> Markus, >> >> Thanks for clarifying! What heap size does this task launch with and what's >> the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM >> killer taking part here? >> >> On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote: >>> Hi, >>> >>> They were indeed reported as failed and speculative execution is disabled >>> because this particular job is not side-effect free. We've seen this >>> before but it's rare. >>> >>> Thanks >>> >>>> Exit code 137 would mean a SIGKILL. Are you positive its reported as a >>>> failed task and not as a killed task? If not, it may just be a result of >>>> speculative reducer execution at work and is nothing to worry about. >>>> >>>> On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote: >>>>> Hi, >>>>> >>>>> We sometimes see reducers fail just when all mappers are finishing. All >>>>> mappers finish roughly at the same time. The reducers only dump the >>>>> following exception: >>>>> >>>>> java.lang.Throwable: Child Error >>>>> >>>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) >>>>> >>>>> Caused by: java.io.IOException: Task process exit with nonzero status >>>>> of 137. >>>>> >>>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) >>>>> >>>>> The reducers own log output also shows nothing that gives a clue, this >>>>> is the last part of the log: >>>>> >>>>> 2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: >>>>> Got brand-new decompressor >>>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: >>>>> Got brand-new decompressor >>>>> 2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: >>>>> Got brand-new decompressor >>>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging >>>>> on- disk files >>>>> 2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging >>>>> on- disk files >>>>> 2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging >>>>> in memory files >>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) >>>>> where 0 is already in progress >>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling >>>>> Map Completion Events >>>>> 2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts >>>>> and0 dup hosts) >>>>> 2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts >>>>> and0 dup hosts) >>>>> 2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts >>>>> and0 dup hosts) >>>>> 2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) >>>>> where 0 is already in progress >>>>> 2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts >>>>> and0 dup hosts) >>>>> 2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts >>>>> and0 dup hosts) >>>>> >>>>> Is there any advice? >>>>> Thanks