Shuffle tasks getting killed

aniket ray Wed, 22 Sep 2010 21:53:29 -0700

Hi,

I continuously run a series of batch job using Hadoop Map Reduce. I also
have a managing daemon that moves data around on the hdfs making way for
more jobs to be run.
I use capacity scheduler to schedule many jobs in parallel.


I see an issue on the Hadoop web monitoring UI at port 50030 which I believe
may be causing a performance bottleneck and wanted to get more information.

Approximately 10% of the reduce tasks show up as "Killed" in the UI. The
logs say that the killed tasks are in the shuffle phase when they are killed
but the logs don't show any exception.
My understanding is that these killed tasks would be started again and this
slows down the whole hadoop job.
I was wondering what the possible issues maybe and how to debug this issue?

I have tried on both the hadoop 0.20.2 and the latest version of hadoop from
yahoo's github.
I've monitored the nodes and there is a lot of free disk space and memory on
all nodes (more than 1 TB free disk and 5 GB free memory at all times on all
nodes).

Since there are no exceptions and any other visible issues, I am finding it
hard to figure out what the problem might be. Could anybody help?

Thanks,
-aniket

Shuffle tasks getting killed

Reply via email to