Hi Cliff, Thanks it did turn out to be speculative execution. When I turned it off, no more tasks were killed and the performance degraded.
So my initial assumptions were incorrect after all. I guess I'll have to look at other ways to improve performance. Thanks for the help. -aniket On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer <palmercl...@gmail.com> wrote: > Aniket, I wonder if these tasks were run as Speculative Execution. Have > you > been able to determine whether the job runs successfully? > HTH > Cliff > > On Thu, Sep 23, 2010 at 12:52 AM, aniket ray <aniket....@gmail.com> wrote: > > > Hi, > > > > I continuously run a series of batch job using Hadoop Map Reduce. I also > > have a managing daemon that moves data around on the hdfs making way for > > more jobs to be run. > > I use capacity scheduler to schedule many jobs in parallel. > > > > I see an issue on the Hadoop web monitoring UI at port 50030 which I > > believe > > may be causing a performance bottleneck and wanted to get more > information. > > > > Approximately 10% of the reduce tasks show up as "Killed" in the UI. The > > logs say that the killed tasks are in the shuffle phase when they are > > killed > > but the logs don't show any exception. > > My understanding is that these killed tasks would be started again and > this > > slows down the whole hadoop job. > > I was wondering what the possible issues maybe and how to debug this > issue? > > > > I have tried on both the hadoop 0.20.2 and the latest version of hadoop > > from > > yahoo's github. > > I've monitored the nodes and there is a lot of free disk space and memory > > on > > all nodes (more than 1 TB free disk and 5 GB free memory at all times on > > all > > nodes). > > > > Since there are no exceptions and any other visible issues, I am finding > it > > hard to figure out what the problem might be. Could anybody help? > > > > Thanks, > > -aniket > > >