this time, I set task timeout to 10m via -jobconf mapred.task.timeout=600000
However, I still see this "hang" at shuffle stage, and lots of messages below appear in the log 2008-09-19 12:34:02,289 INFO org.apache.hadoop.mapred.ReduceTask: task_200809190308_0007_r_000001_1 Need 6 map output(s) 2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask: task_200809190308_0007_r_000001_1: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask: task_200809190308_0007_r_000001_1 Got 6 known map output location(s); scheduling... 2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask: task_200809190308_0007_r_000001_1 Scheduled 0 of 6 known outputs (6 slow hosts and 0 dup hosts) When fetching map output from one weird node (actually, it has a disk died), the http daemon returns 500 internal server error. It seems to me that the reducer fails in an infinite loop... I'm wondering this behavior is fixed in 0.18.x or there is some configuration parameters that I should tune with? Thanks, Rong-En Fan On Fri, Sep 19, 2008 at 9:42 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote: > Reply to myself. I'm using streaming and the task timeout was set to 0, > so that's why. > > On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I'm using 0.17.2.1 and see a reduce hang in shuffle phase due >> to a unresponsive node. From the reduce log (sorry that I didn't >> keep it around), it stuck in copying map output from a dead >> node (I can not ssh to that one). At that point, all maps are already >> finished. I'm wondering why this slowness does not trigger a reduce >> task fail and the corresponding map failed (even if it is finished) then >> redo the map task on another node so that the reduce can work. >> >> Thanks, >> Rong-En Fan >> >