Re: Shuffle tasks getting killed

2010-09-24 Thread cliff palmer
I'm glad it helped Aniket.  I would recommend that you start working on
performance improvement with your network infrastructure and the balance of
data across your logical racks.Cliff

On Fri, Sep 24, 2010 at 12:12 AM, aniket ray aniket@gmail.com wrote:

 Hi Cliff,

 Thanks it did turn out to be speculative execution. When I turned it off,
 no
 more tasks were killed and the performance degraded.

 So my initial assumptions were incorrect after all. I guess I'll have to
 look at other ways to improve performance.

 Thanks for the help.
 -aniket

 On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer palmercl...@gmail.com
 wrote:

  Aniket, I wonder if these tasks were run as Speculative Execution.  Have
  you
  been able to determine whether the job runs successfully?
  HTH
  Cliff
 
  On Thu, Sep 23, 2010 at 12:52 AM, aniket ray aniket@gmail.com
 wrote:
 
   Hi,
  
   I continuously run a series of batch job using Hadoop Map Reduce. I
 also
   have a managing daemon that moves data around on the hdfs making way
 for
   more jobs to be run.
   I use capacity scheduler to schedule many jobs in parallel.
  
   I see an issue on the Hadoop web monitoring UI at port 50030 which I
   believe
   may be causing a performance bottleneck and wanted to get more
  information.
  
   Approximately 10% of the reduce tasks show up as Killed in the UI.
 The
   logs say that the killed tasks are in the shuffle phase when they are
   killed
   but the logs don't show any exception.
   My understanding is that these killed tasks would be started again and
  this
   slows down the whole hadoop job.
   I was wondering what the possible issues maybe and how to debug this
  issue?
  
   I have tried on both the hadoop 0.20.2 and the latest version of hadoop
   from
   yahoo's github.
   I've monitored the nodes and there is a lot of free disk space and
 memory
   on
   all nodes (more than 1 TB free disk and 5 GB free memory at all times
 on
   all
   nodes).
  
   Since there are no exceptions and any other visible issues, I am
 finding
  it
   hard to figure out what the problem might be. Could anybody help?
  
   Thanks,
   -aniket
  
 



Re: Shuffle tasks getting killed

2010-09-23 Thread cliff palmer
Aniket, I wonder if these tasks were run as Speculative Execution.  Have you
been able to determine whether the job runs successfully?
HTH
Cliff

On Thu, Sep 23, 2010 at 12:52 AM, aniket ray aniket@gmail.com wrote:

 Hi,

 I continuously run a series of batch job using Hadoop Map Reduce. I also
 have a managing daemon that moves data around on the hdfs making way for
 more jobs to be run.
 I use capacity scheduler to schedule many jobs in parallel.

 I see an issue on the Hadoop web monitoring UI at port 50030 which I
 believe
 may be causing a performance bottleneck and wanted to get more information.

 Approximately 10% of the reduce tasks show up as Killed in the UI. The
 logs say that the killed tasks are in the shuffle phase when they are
 killed
 but the logs don't show any exception.
 My understanding is that these killed tasks would be started again and this
 slows down the whole hadoop job.
 I was wondering what the possible issues maybe and how to debug this issue?

 I have tried on both the hadoop 0.20.2 and the latest version of hadoop
 from
 yahoo's github.
 I've monitored the nodes and there is a lot of free disk space and memory
 on
 all nodes (more than 1 TB free disk and 5 GB free memory at all times on
 all
 nodes).

 Since there are no exceptions and any other visible issues, I am finding it
 hard to figure out what the problem might be. Could anybody help?

 Thanks,
 -aniket



Re: Shuffle tasks getting killed

2010-09-23 Thread aniket ray
Hi Cliff,

Thanks it did turn out to be speculative execution. When I turned it off, no
more tasks were killed and the performance degraded.

So my initial assumptions were incorrect after all. I guess I'll have to
look at other ways to improve performance.

Thanks for the help.
-aniket

On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer palmercl...@gmail.com wrote:

 Aniket, I wonder if these tasks were run as Speculative Execution.  Have
 you
 been able to determine whether the job runs successfully?
 HTH
 Cliff

 On Thu, Sep 23, 2010 at 12:52 AM, aniket ray aniket@gmail.com wrote:

  Hi,
 
  I continuously run a series of batch job using Hadoop Map Reduce. I also
  have a managing daemon that moves data around on the hdfs making way for
  more jobs to be run.
  I use capacity scheduler to schedule many jobs in parallel.
 
  I see an issue on the Hadoop web monitoring UI at port 50030 which I
  believe
  may be causing a performance bottleneck and wanted to get more
 information.
 
  Approximately 10% of the reduce tasks show up as Killed in the UI. The
  logs say that the killed tasks are in the shuffle phase when they are
  killed
  but the logs don't show any exception.
  My understanding is that these killed tasks would be started again and
 this
  slows down the whole hadoop job.
  I was wondering what the possible issues maybe and how to debug this
 issue?
 
  I have tried on both the hadoop 0.20.2 and the latest version of hadoop
  from
  yahoo's github.
  I've monitored the nodes and there is a lot of free disk space and memory
  on
  all nodes (more than 1 TB free disk and 5 GB free memory at all times on
  all
  nodes).
 
  Since there are no exceptions and any other visible issues, I am finding
 it
  hard to figure out what the problem might be. Could anybody help?
 
  Thanks,
  -aniket
 



Shuffle tasks getting killed

2010-09-22 Thread aniket ray
Hi,

I continuously run a series of batch job using Hadoop Map Reduce. I also
have a managing daemon that moves data around on the hdfs making way for
more jobs to be run.
I use capacity scheduler to schedule many jobs in parallel.

I see an issue on the Hadoop web monitoring UI at port 50030 which I believe
may be causing a performance bottleneck and wanted to get more information.

Approximately 10% of the reduce tasks show up as Killed in the UI. The
logs say that the killed tasks are in the shuffle phase when they are killed
but the logs don't show any exception.
My understanding is that these killed tasks would be started again and this
slows down the whole hadoop job.
I was wondering what the possible issues maybe and how to debug this issue?

I have tried on both the hadoop 0.20.2 and the latest version of hadoop from
yahoo's github.
I've monitored the nodes and there is a lot of free disk space and memory on
all nodes (more than 1 TB free disk and 5 GB free memory at all times on all
nodes).

Since there are no exceptions and any other visible issues, I am finding it
hard to figure out what the problem might be. Could anybody help?

Thanks,
-aniket