Re: Shuffle tasks getting killed
I'm glad it helped Aniket. I would recommend that you start working on performance improvement with your network infrastructure and the balance of data across your logical racks.Cliff On Fri, Sep 24, 2010 at 12:12 AM, aniket ray aniket@gmail.com wrote: Hi Cliff, Thanks it did turn out to be speculative execution. When I turned it off, no more tasks were killed and the performance degraded. So my initial assumptions were incorrect after all. I guess I'll have to look at other ways to improve performance. Thanks for the help. -aniket On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer palmercl...@gmail.com wrote: Aniket, I wonder if these tasks were run as Speculative Execution. Have you been able to determine whether the job runs successfully? HTH Cliff On Thu, Sep 23, 2010 at 12:52 AM, aniket ray aniket@gmail.com wrote: Hi, I continuously run a series of batch job using Hadoop Map Reduce. I also have a managing daemon that moves data around on the hdfs making way for more jobs to be run. I use capacity scheduler to schedule many jobs in parallel. I see an issue on the Hadoop web monitoring UI at port 50030 which I believe may be causing a performance bottleneck and wanted to get more information. Approximately 10% of the reduce tasks show up as Killed in the UI. The logs say that the killed tasks are in the shuffle phase when they are killed but the logs don't show any exception. My understanding is that these killed tasks would be started again and this slows down the whole hadoop job. I was wondering what the possible issues maybe and how to debug this issue? I have tried on both the hadoop 0.20.2 and the latest version of hadoop from yahoo's github. I've monitored the nodes and there is a lot of free disk space and memory on all nodes (more than 1 TB free disk and 5 GB free memory at all times on all nodes). Since there are no exceptions and any other visible issues, I am finding it hard to figure out what the problem might be. Could anybody help? Thanks, -aniket
Re: Shuffle tasks getting killed
Aniket, I wonder if these tasks were run as Speculative Execution. Have you been able to determine whether the job runs successfully? HTH Cliff On Thu, Sep 23, 2010 at 12:52 AM, aniket ray aniket@gmail.com wrote: Hi, I continuously run a series of batch job using Hadoop Map Reduce. I also have a managing daemon that moves data around on the hdfs making way for more jobs to be run. I use capacity scheduler to schedule many jobs in parallel. I see an issue on the Hadoop web monitoring UI at port 50030 which I believe may be causing a performance bottleneck and wanted to get more information. Approximately 10% of the reduce tasks show up as Killed in the UI. The logs say that the killed tasks are in the shuffle phase when they are killed but the logs don't show any exception. My understanding is that these killed tasks would be started again and this slows down the whole hadoop job. I was wondering what the possible issues maybe and how to debug this issue? I have tried on both the hadoop 0.20.2 and the latest version of hadoop from yahoo's github. I've monitored the nodes and there is a lot of free disk space and memory on all nodes (more than 1 TB free disk and 5 GB free memory at all times on all nodes). Since there are no exceptions and any other visible issues, I am finding it hard to figure out what the problem might be. Could anybody help? Thanks, -aniket
Re: Shuffle tasks getting killed
Hi Cliff, Thanks it did turn out to be speculative execution. When I turned it off, no more tasks were killed and the performance degraded. So my initial assumptions were incorrect after all. I guess I'll have to look at other ways to improve performance. Thanks for the help. -aniket On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer palmercl...@gmail.com wrote: Aniket, I wonder if these tasks were run as Speculative Execution. Have you been able to determine whether the job runs successfully? HTH Cliff On Thu, Sep 23, 2010 at 12:52 AM, aniket ray aniket@gmail.com wrote: Hi, I continuously run a series of batch job using Hadoop Map Reduce. I also have a managing daemon that moves data around on the hdfs making way for more jobs to be run. I use capacity scheduler to schedule many jobs in parallel. I see an issue on the Hadoop web monitoring UI at port 50030 which I believe may be causing a performance bottleneck and wanted to get more information. Approximately 10% of the reduce tasks show up as Killed in the UI. The logs say that the killed tasks are in the shuffle phase when they are killed but the logs don't show any exception. My understanding is that these killed tasks would be started again and this slows down the whole hadoop job. I was wondering what the possible issues maybe and how to debug this issue? I have tried on both the hadoop 0.20.2 and the latest version of hadoop from yahoo's github. I've monitored the nodes and there is a lot of free disk space and memory on all nodes (more than 1 TB free disk and 5 GB free memory at all times on all nodes). Since there are no exceptions and any other visible issues, I am finding it hard to figure out what the problem might be. Could anybody help? Thanks, -aniket
Shuffle tasks getting killed
Hi, I continuously run a series of batch job using Hadoop Map Reduce. I also have a managing daemon that moves data around on the hdfs making way for more jobs to be run. I use capacity scheduler to schedule many jobs in parallel. I see an issue on the Hadoop web monitoring UI at port 50030 which I believe may be causing a performance bottleneck and wanted to get more information. Approximately 10% of the reduce tasks show up as Killed in the UI. The logs say that the killed tasks are in the shuffle phase when they are killed but the logs don't show any exception. My understanding is that these killed tasks would be started again and this slows down the whole hadoop job. I was wondering what the possible issues maybe and how to debug this issue? I have tried on both the hadoop 0.20.2 and the latest version of hadoop from yahoo's github. I've monitored the nodes and there is a lot of free disk space and memory on all nodes (more than 1 TB free disk and 5 GB free memory at all times on all nodes). Since there are no exceptions and any other visible issues, I am finding it hard to figure out what the problem might be. Could anybody help? Thanks, -aniket