Re: Hadoop JobTracker Hanging

2010-06-24 Thread Bobby Dennett
 Thanks for the latest round of suggestions. We will definitely check 
out compressed object pointers and are looking into what we can do 
regarding the JT history. As I mentioned previously, we are working on 
getting stronger servers for the NN/JT node and the secondary NN node 
(similar to workaround (c) below). Engineering is also working on 
improving one of our processes that accesses a large number of 
potentially smaller files to try and reduce our maximum number of map 
tasks (similar to workaround (b) below).


On a side note, our JT process has been running since Saturday morning 
after increasing the heap size to 6,000 MB... so far, so good. 
Hopefully, I didn't just jinx it ;o)


-Bobby

On 6/22/10 10:12 AM, Rahul Jain wrote:

There are two issues which were fixed in 0.21.0  and can cause job tracker
to run out of memory:

https://issues.apache.org/jira/browse/MAPREDUCE-1316

and

https://issues.apache.org/jira/browse/MAPREDUCE-841

We've been hit by MAPREDUCE-841  (large jobConf objects with large number of
tasks, especially when running pig jobs) a number of times in hadoop 0.20.1,
0.20.2+.

The current workarounds are:

a) Be careful about what you store in jobConf object
b)  Understand and control the largest number of mappers/reducers that can
be queued at any time for processing.
c) Provide lot of RAM to jobTracker

We use (c) to save on debugging man hours most of the time :).

-Rahul

On Tue, Jun 22, 2010 at 8:53 AM, Allen Wittenauer
awittena...@linkedin.comwrote:


On Jun 22, 2010, at 3:17 AM, Steve Loughran wrote:

I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary

NN that use more, especially if the files are many and the blocksize small.
the JT should not be tracking that much data over time

Pre-0.20.2, there are definitely bugs with how the JT history is handled,
causing some memory leakage.

The other fairly common condition is if you have way too many tasks per
job.  This is usually an indication that your data layout is way out of
whack (too little data in too many files) or that you should be using
CombinedFileInputFormat.


Re: Hadoop JobTracker Hanging

2010-06-22 Thread Steve Loughran

Bobby Dennett wrote:

Thanks all for your suggestions (please note that Tan is my co-worker;
we are both working to try and resolve this issue)... we experienced
another hang this weekend and increased the HADOOP_HEAPSIZE setting to
6000 (MB) as we do periodically see java.lang.OutOfMemoryError: Java
heap space errors in the jobtracker log. We are now looking into the
resource allocation of the master node/server to ensure we aren't
experiencing any issues due to the heap size increase. In parallel, we
are also working on building beefier servers -- stronger CPUs, 3x more
memory -- for the node running the primary namenode and jobtracker
processes as well as for the secondary namenode.

Any additional suggestions you might have for troubleshooting/resolving
this hanging jobtracker issue would be greatly appreciated.


Have you tried
 * using compressed object pointers on java 6 server? They reduce space

 * bolder: JRockit JVM. Not officially supported in Hadoop, but I liked 
using right up until oracle stopped giving away the updates with 
security patches. It has a way better heap as well as compressed 
pointers for a long time (==more stable code)


I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 
2ary NN that use more, especially if the files are many and the 
blocksize small. the JT should not be tracking that much data over time


Re: Hadoop JobTracker Hanging

2010-06-22 Thread Allen Wittenauer

On Jun 22, 2010, at 3:17 AM, Steve Loughran wrote:
 
 I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary NN 
 that use more, especially if the files are many and the blocksize small. the 
 JT should not be tracking that much data over time

Pre-0.20.2, there are definitely bugs with how the JT history is handled, 
causing some memory leakage.

The other fairly common condition is if you have way too many tasks per job.  
This is usually an indication that your data layout is way out of whack (too 
little data in too many files) or that you should be using 
CombinedFileInputFormat.

Re: Hadoop JobTracker Hanging

2010-06-22 Thread Hemanth Yamijala
There was also https://issues.apache.org/jira/browse/MAPREDUCE-1316
whose cause hit clusters at Yahoo! very badly last year. The situation
was particularly noticeable in the face of lots of jobs with failed
tasks and a specific fix that enabled OutOfBand heartbeats. The latter
(i.e. the OOB heartbeats patch) is not in 0.20 AFAIK, but still the
failed tasks could be causing it.

Thanks
Hemanth


On Tue, Jun 22, 2010 at 3:47 PM, Steve Loughran ste...@apache.org wrote:
 Bobby Dennett wrote:

 Thanks all for your suggestions (please note that Tan is my co-worker;
 we are both working to try and resolve this issue)... we experienced
 another hang this weekend and increased the HADOOP_HEAPSIZE setting to
 6000 (MB) as we do periodically see java.lang.OutOfMemoryError: Java
 heap space errors in the jobtracker log. We are now looking into the
 resource allocation of the master node/server to ensure we aren't
 experiencing any issues due to the heap size increase. In parallel, we
 are also working on building beefier servers -- stronger CPUs, 3x more
 memory -- for the node running the primary namenode and jobtracker
 processes as well as for the secondary namenode.

 Any additional suggestions you might have for troubleshooting/resolving
 this hanging jobtracker issue would be greatly appreciated.

 Have you tried
  * using compressed object pointers on java 6 server? They reduce space

  * bolder: JRockit JVM. Not officially supported in Hadoop, but I liked
 using right up until oracle stopped giving away the updates with security
 patches. It has a way better heap as well as compressed pointers for a long
 time (==more stable code)

 I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary NN
 that use more, especially if the files are many and the blocksize small. the
 JT should not be tracking that much data over time



Re: Hadoop JobTracker Hanging

2010-06-22 Thread Rahul Jain
There are two issues which were fixed in 0.21.0  and can cause job tracker
to run out of memory:

https://issues.apache.org/jira/browse/MAPREDUCE-1316

and

https://issues.apache.org/jira/browse/MAPREDUCE-841

We've been hit by MAPREDUCE-841  (large jobConf objects with large number of
tasks, especially when running pig jobs) a number of times in hadoop 0.20.1,
0.20.2+.

The current workarounds are:

a) Be careful about what you store in jobConf object
b)  Understand and control the largest number of mappers/reducers that can
be queued at any time for processing.
c) Provide lot of RAM to jobTracker

We use (c) to save on debugging man hours most of the time :).

-Rahul

On Tue, Jun 22, 2010 at 8:53 AM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On Jun 22, 2010, at 3:17 AM, Steve Loughran wrote:
 
  I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary
 NN that use more, especially if the files are many and the blocksize small.
 the JT should not be tracking that much data over time

 Pre-0.20.2, there are definitely bugs with how the JT history is handled,
 causing some memory leakage.

 The other fairly common condition is if you have way too many tasks per
 job.  This is usually an indication that your data layout is way out of
 whack (too little data in too many files) or that you should be using
 CombinedFileInputFormat.


RE: Hadoop JobTracker Hanging

2010-06-21 Thread Bobby Dennett
Thanks all for your suggestions (please note that Tan is my co-worker;
we are both working to try and resolve this issue)... we experienced
another hang this weekend and increased the HADOOP_HEAPSIZE setting to
6000 (MB) as we do periodically see java.lang.OutOfMemoryError: Java
heap space errors in the jobtracker log. We are now looking into the
resource allocation of the master node/server to ensure we aren't
experiencing any issues due to the heap size increase. In parallel, we
are also working on building beefier servers -- stronger CPUs, 3x more
memory -- for the node running the primary namenode and jobtracker
processes as well as for the secondary namenode.

Any additional suggestions you might have for troubleshooting/resolving
this hanging jobtracker issue would be greatly appreciated.

Please note that I had previously started a similar topic on Get
Satisfaction
(http://www.getsatisfaction.com/cloudera/topics/looking_for_troubleshooting_tips_guidance_for_hanging_jobtracker)
where Todd is helping and the output of jstack and jmap can be found.

Thanks,
-Bobby

On Fri, 18 Jun 2010 15:04 -0600, Li, Tan t...@shopping.com wrote:
 Todd,
 I will try to increase the HADOOP_HEAPSIZE to see if that helps.
 Tan
 
 -Original Message-
 From: Todd Lipcon [mailto:t...@cloudera.com] 
 Sent: Thursday, June 17, 2010 5:07 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop JobTracker Hanging
 
 Li, just to narrow your search, in my experience this is usually caused
 by
 OOME on the JT. Check the logs for OutOfMemoryException, see what you
 find.
 You may need to configure it to retain fewer jobs in memory, or up your
 heap.
 
 -Todd
 
 On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan t...@shopping.com wrote:
 
  Thanks for your tips, Ted.
  All of our QA is done on 0.20.1, and I got a feeling it is not version
  related.
  I will run jstack and jmap once the problem happens again and I may need
  your help to analyze the result.
 
  Tan
 
  -Original Message-
  From: Ted Yu [mailto:yuzhih...@gmail.com]
  Sent: Thursday, June 17, 2010 2:39 PM
  To: common-user@hadoop.apache.org
  Subject: Re: Hadoop JobTracker Hanging
 
  Is upgrading to hadoop-0.20.2+228 possible ?
 
  Use jstack to get stack trace of job tracker process when this happens
  again.
  Use jmap to get shared object memory maps or heap memory details.
 
  On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan t...@shopping.com wrote:
 
   Folks,
  
   I need some help on job tracker.
   I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is
  with
   version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
   (Cloudera).
  
   I have the same problem with both the clusters: the job tracker hangs
   almost once a day.
   Symptom: The job tracker web page can not be loaded, the command hadoop
   job -list hangs and jobtracker.log file stops being updated.
   No useful information can I find in the job tracker log file.
   The symptom is gone after I restart the job tracker and the cluster runs
   fine for another 20+ hour period. And then the symptom comes back.
  
   I do not have serious problem with HDFS.
  
   Any ideas about the causes? Any configuration parameter that I can change
   to reduce the chances of the problem?
   Any tips for diagnosing and troubleshooting?
  
   Thanks!
  
   Tan
  
  
  
  
 
 
 
 
 -- 
 Todd Lipcon
 Software Engineer, Cloudera
 


RE: Hadoop JobTracker Hanging

2010-06-18 Thread Li, Tan
Thanks for your suggestions, James.
I will try that.
Tan

-Original Message-
From: James Seigel [mailto:ja...@tynt.com] 
Sent: Thursday, June 17, 2010 6:21 PM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop JobTracker Hanging

Up the memory from the default to about 4x the default (heap setting).  This 
should make it better I'd think!

We'd been having the same issue...I believe this fixed it.

James

On 2010-06-17, at 3:00 PM, Li, Tan wrote:

 Folks,
 
 I need some help on job tracker.
 I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with 
 version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68 
 (Cloudera).
 
 I have the same problem with both the clusters: the job tracker hangs almost 
 once a day.
 Symptom: The job tracker web page can not be loaded, the command hadoop job 
 -list hangs and jobtracker.log file stops being updated.
 No useful information can I find in the job tracker log file.
 The symptom is gone after I restart the job tracker and the cluster runs fine 
 for another 20+ hour period. And then the symptom comes back.
 
 I do not have serious problem with HDFS.
 
 Any ideas about the causes? Any configuration parameter that I can change to 
 reduce the chances of the problem?
 Any tips for diagnosing and troubleshooting?
 
 Thanks!
 
 Tan
 
 
 



RE: Hadoop JobTracker Hanging

2010-06-18 Thread Li, Tan
Todd,
I will try to increase the HADOOP_HEAPSIZE to see if that helps.
Tan

-Original Message-
From: Todd Lipcon [mailto:t...@cloudera.com] 
Sent: Thursday, June 17, 2010 5:07 PM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop JobTracker Hanging

Li, just to narrow your search, in my experience this is usually caused by
OOME on the JT. Check the logs for OutOfMemoryException, see what you find.
You may need to configure it to retain fewer jobs in memory, or up your
heap.

-Todd

On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan t...@shopping.com wrote:

 Thanks for your tips, Ted.
 All of our QA is done on 0.20.1, and I got a feeling it is not version
 related.
 I will run jstack and jmap once the problem happens again and I may need
 your help to analyze the result.

 Tan

 -Original Message-
 From: Ted Yu [mailto:yuzhih...@gmail.com]
 Sent: Thursday, June 17, 2010 2:39 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop JobTracker Hanging

 Is upgrading to hadoop-0.20.2+228 possible ?

 Use jstack to get stack trace of job tracker process when this happens
 again.
 Use jmap to get shared object memory maps or heap memory details.

 On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan t...@shopping.com wrote:

  Folks,
 
  I need some help on job tracker.
  I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is
 with
  version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
  (Cloudera).
 
  I have the same problem with both the clusters: the job tracker hangs
  almost once a day.
  Symptom: The job tracker web page can not be loaded, the command hadoop
  job -list hangs and jobtracker.log file stops being updated.
  No useful information can I find in the job tracker log file.
  The symptom is gone after I restart the job tracker and the cluster runs
  fine for another 20+ hour period. And then the symptom comes back.
 
  I do not have serious problem with HDFS.
 
  Any ideas about the causes? Any configuration parameter that I can change
  to reduce the chances of the problem?
  Any tips for diagnosing and troubleshooting?
 
  Thanks!
 
  Tan
 
 
 
 




-- 
Todd Lipcon
Software Engineer, Cloudera


Hadoop JobTracker Hanging

2010-06-17 Thread Li, Tan
Folks,

I need some help on job tracker.
I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with 
version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68 
(Cloudera).

I have the same problem with both the clusters: the job tracker hangs almost 
once a day.
Symptom: The job tracker web page can not be loaded, the command hadoop job 
-list hangs and jobtracker.log file stops being updated.
No useful information can I find in the job tracker log file.
The symptom is gone after I restart the job tracker and the cluster runs fine 
for another 20+ hour period. And then the symptom comes back.

I do not have serious problem with HDFS.

Any ideas about the causes? Any configuration parameter that I can change to 
reduce the chances of the problem?
Any tips for diagnosing and troubleshooting?

Thanks!

Tan





Re: Hadoop JobTracker Hanging

2010-06-17 Thread Ted Yu
Is upgrading to hadoop-0.20.2+228 possible ?

Use jstack to get stack trace of job tracker process when this happens
again.
Use jmap to get shared object memory maps or heap memory details.

On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan t...@shopping.com wrote:

 Folks,

 I need some help on job tracker.
 I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
 version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
 (Cloudera).

 I have the same problem with both the clusters: the job tracker hangs
 almost once a day.
 Symptom: The job tracker web page can not be loaded, the command hadoop
 job -list hangs and jobtracker.log file stops being updated.
 No useful information can I find in the job tracker log file.
 The symptom is gone after I restart the job tracker and the cluster runs
 fine for another 20+ hour period. And then the symptom comes back.

 I do not have serious problem with HDFS.

 Any ideas about the causes? Any configuration parameter that I can change
 to reduce the chances of the problem?
 Any tips for diagnosing and troubleshooting?

 Thanks!

 Tan






Re: Hadoop JobTracker Hanging

2010-06-17 Thread Todd Lipcon
+1, jstack is crucial to solve these kinds of issues. Also, which scheduler
are you using?

Thanks
-Todd

On Thu, Jun 17, 2010 at 2:38 PM, Ted Yu yuzhih...@gmail.com wrote:

 Is upgrading to hadoop-0.20.2+228 possible ?

 Use jstack to get stack trace of job tracker process when this happens
 again.
 Use jmap to get shared object memory maps or heap memory details.

 On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan t...@shopping.com wrote:

  Folks,
 
  I need some help on job tracker.
  I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is
 with
  version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
  (Cloudera).
 
  I have the same problem with both the clusters: the job tracker hangs
  almost once a day.
  Symptom: The job tracker web page can not be loaded, the command hadoop
  job -list hangs and jobtracker.log file stops being updated.
  No useful information can I find in the job tracker log file.
  The symptom is gone after I restart the job tracker and the cluster runs
  fine for another 20+ hour period. And then the symptom comes back.
 
  I do not have serious problem with HDFS.
 
  Any ideas about the causes? Any configuration parameter that I can change
  to reduce the chances of the problem?
  Any tips for diagnosing and troubleshooting?
 
  Thanks!
 
  Tan
 
 
 
 




-- 
Todd Lipcon
Software Engineer, Cloudera


RE: Hadoop JobTracker Hanging

2010-06-17 Thread Li, Tan
Thanks, Todd.
I will try that and let you know the result.
Tan

-Original Message-
From: Todd Lipcon [mailto:t...@cloudera.com] 
Sent: Thursday, June 17, 2010 2:41 PM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop JobTracker Hanging

+1, jstack is crucial to solve these kinds of issues. Also, which scheduler
are you using?

Thanks
-Todd

On Thu, Jun 17, 2010 at 2:38 PM, Ted Yu yuzhih...@gmail.com wrote:

 Is upgrading to hadoop-0.20.2+228 possible ?

 Use jstack to get stack trace of job tracker process when this happens
 again.
 Use jmap to get shared object memory maps or heap memory details.

 On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan t...@shopping.com wrote:

  Folks,
 
  I need some help on job tracker.
  I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is
 with
  version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
  (Cloudera).
 
  I have the same problem with both the clusters: the job tracker hangs
  almost once a day.
  Symptom: The job tracker web page can not be loaded, the command hadoop
  job -list hangs and jobtracker.log file stops being updated.
  No useful information can I find in the job tracker log file.
  The symptom is gone after I restart the job tracker and the cluster runs
  fine for another 20+ hour period. And then the symptom comes back.
 
  I do not have serious problem with HDFS.
 
  Any ideas about the causes? Any configuration parameter that I can change
  to reduce the chances of the problem?
  Any tips for diagnosing and troubleshooting?
 
  Thanks!
 
  Tan
 
 
 
 




-- 
Todd Lipcon
Software Engineer, Cloudera


RE: Hadoop JobTracker Hanging

2010-06-17 Thread Li, Tan
Thanks for your tips, Ted.
All of our QA is done on 0.20.1, and I got a feeling it is not version related.
I will run jstack and jmap once the problem happens again and I may need your 
help to analyze the result.

Tan

-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: Thursday, June 17, 2010 2:39 PM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop JobTracker Hanging

Is upgrading to hadoop-0.20.2+228 possible ?

Use jstack to get stack trace of job tracker process when this happens
again.
Use jmap to get shared object memory maps or heap memory details.

On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan t...@shopping.com wrote:

 Folks,

 I need some help on job tracker.
 I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
 version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
 (Cloudera).

 I have the same problem with both the clusters: the job tracker hangs
 almost once a day.
 Symptom: The job tracker web page can not be loaded, the command hadoop
 job -list hangs and jobtracker.log file stops being updated.
 No useful information can I find in the job tracker log file.
 The symptom is gone after I restart the job tracker and the cluster runs
 fine for another 20+ hour period. And then the symptom comes back.

 I do not have serious problem with HDFS.

 Any ideas about the causes? Any configuration parameter that I can change
 to reduce the chances of the problem?
 Any tips for diagnosing and troubleshooting?

 Thanks!

 Tan






Re: Hadoop JobTracker Hanging

2010-06-17 Thread Todd Lipcon
Li, just to narrow your search, in my experience this is usually caused by
OOME on the JT. Check the logs for OutOfMemoryException, see what you find.
You may need to configure it to retain fewer jobs in memory, or up your
heap.

-Todd

On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan t...@shopping.com wrote:

 Thanks for your tips, Ted.
 All of our QA is done on 0.20.1, and I got a feeling it is not version
 related.
 I will run jstack and jmap once the problem happens again and I may need
 your help to analyze the result.

 Tan

 -Original Message-
 From: Ted Yu [mailto:yuzhih...@gmail.com]
 Sent: Thursday, June 17, 2010 2:39 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop JobTracker Hanging

 Is upgrading to hadoop-0.20.2+228 possible ?

 Use jstack to get stack trace of job tracker process when this happens
 again.
 Use jmap to get shared object memory maps or heap memory details.

 On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan t...@shopping.com wrote:

  Folks,
 
  I need some help on job tracker.
  I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is
 with
  version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
  (Cloudera).
 
  I have the same problem with both the clusters: the job tracker hangs
  almost once a day.
  Symptom: The job tracker web page can not be loaded, the command hadoop
  job -list hangs and jobtracker.log file stops being updated.
  No useful information can I find in the job tracker log file.
  The symptom is gone after I restart the job tracker and the cluster runs
  fine for another 20+ hour period. And then the symptom comes back.
 
  I do not have serious problem with HDFS.
 
  Any ideas about the causes? Any configuration parameter that I can change
  to reduce the chances of the problem?
  Any tips for diagnosing and troubleshooting?
 
  Thanks!
 
  Tan
 
 
 
 




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Hadoop JobTracker Hanging

2010-06-17 Thread James Seigel
Up the memory from the default to about 4x the default (heap setting).  This 
should make it better I’d think!

We’d been having the same issue...I believe this fixed it.

James

On 2010-06-17, at 3:00 PM, Li, Tan wrote:

 Folks,
 
 I need some help on job tracker.
 I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with 
 version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68 
 (Cloudera).
 
 I have the same problem with both the clusters: the job tracker hangs almost 
 once a day.
 Symptom: The job tracker web page can not be loaded, the command hadoop job 
 -list hangs and jobtracker.log file stops being updated.
 No useful information can I find in the job tracker log file.
 The symptom is gone after I restart the job tracker and the cluster runs fine 
 for another 20+ hour period. And then the symptom comes back.
 
 I do not have serious problem with HDFS.
 
 Any ideas about the causes? Any configuration parameter that I can change to 
 reduce the chances of the problem?
 Any tips for diagnosing and troubleshooting?
 
 Thanks!
 
 Tan