Re: slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
this time, I set task timeout to 10m via

  -jobconf mapred.task.timeout=60

However, I still see this "hang" at shuffle stage, and lots
of messages below appear in the log

2008-09-19 12:34:02,289 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Need 6 map output(s)
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1: Got 0 new map-outputs & 0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Got 6 known map output location(s);
scheduling...
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Scheduled 0 of 6 known outputs (6
slow hosts and 0 dup hosts)

When fetching map output from one weird node (actually, it has a disk died),
the http daemon returns 500 internal server error.

It seems to me that the reducer fails in an infinite loop... I'm wondering
this behavior is fixed in 0.18.x or there is some configuration parameters
that I should tune with?

Thanks,
Rong-En Fan

On Fri, Sep 19, 2008 at 9:42 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote:
> Reply to myself. I'm using streaming and the task timeout was set to 0,
> so that's why.
>
> On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
>> to a unresponsive node. From the reduce log (sorry that I didn't
>> keep it around), it stuck in copying map output from a dead
>> node (I can not ssh to that one). At that point, all maps are already
>> finished. I'm wondering why this slowness does not trigger a reduce
>> task fail and the corresponding map failed (even if it is finished) then
>> redo the map task on  another node so that the reduce can work.
>>
>> Thanks,
>> Rong-En Fan
>>
>


Re: slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
Reply to myself. I'm using streaming and the task timeout was set to 0,
so that's why.

On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
> to a unresponsive node. From the reduce log (sorry that I didn't
> keep it around), it stuck in copying map output from a dead
> node (I can not ssh to that one). At that point, all maps are already
> finished. I'm wondering why this slowness does not trigger a reduce
> task fail and the corresponding map failed (even if it is finished) then
> redo the map task on  another node so that the reduce can work.
>
> Thanks,
> Rong-En Fan
>


slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
Hi,

I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
to a unresponsive node. From the reduce log (sorry that I didn't
keep it around), it stuck in copying map output from a dead
node (I can not ssh to that one). At that point, all maps are already
finished. I'm wondering why this slowness does not trigger a reduce
task fail and the corresponding map failed (even if it is finished) then
redo the map task on  another node so that the reduce can work.

Thanks,
Rong-En Fan