Hi, I continuously run a series of batch job using Hadoop Map Reduce. I also have a managing daemon that moves data around on the hdfs making way for more jobs to be run. I use capacity scheduler to schedule many jobs in parallel.
I see an issue on the Hadoop web monitoring UI at port 50030 which I believe may be causing a performance bottleneck and wanted to get more information. Approximately 10% of the reduce tasks show up as "Killed" in the UI. The logs say that the killed tasks are in the shuffle phase when they are killed but the logs don't show any exception. My understanding is that these killed tasks would be started again and this slows down the whole hadoop job. I was wondering what the possible issues maybe and how to debug this issue? I have tried on both the hadoop 0.20.2 and the latest version of hadoop from yahoo's github. I've monitored the nodes and there is a lot of free disk space and memory on all nodes (more than 1 TB free disk and 5 GB free memory at all times on all nodes). Since there are no exceptions and any other visible issues, I am finding it hard to figure out what the problem might be. Could anybody help? Thanks, -aniket