Hi,

I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
to a unresponsive node. From the reduce log (sorry that I didn't
keep it around), it stuck in copying map output from a dead
node (I can not ssh to that one). At that point, all maps are already
finished. I'm wondering why this slowness does not trigger a reduce
task fail and the corresponding map failed (even if it is finished) then
redo the map task on  another node so that the reduce can work.

Thanks,
Rong-En Fan

Reply via email to