Hi, I'm using 0.17.2.1 and see a reduce hang in shuffle phase due to a unresponsive node. From the reduce log (sorry that I didn't keep it around), it stuck in copying map output from a dead node (I can not ssh to that one). At that point, all maps are already finished. I'm wondering why this slowness does not trigger a reduce task fail and the corresponding map failed (even if it is finished) then redo the map task on another node so that the reduce can work.
Thanks, Rong-En Fan