root cause analysis when jobtracker hang

java8964 java8964 Wed, 06 Feb 2013 18:12:56 -0800

Our cluster on cdh3u4 has the same problem. I think it is caused by some bugs 
in JobTracker. I believe Cloudera knows about this issue.
After upgrading to cdh3u5, we havn't faced this issue yet, but I am not sure if 
it is confirmed to fix in the CDH3U5.
Yong


> Date: Mon, 4 Feb 2013 15:21:18 -0800
> Subject: What to do/check/debug/root cause analysis when jobtracker hang
> From: silvianhad...@gmail.com
> To: user@hadoop.apache.org
> 
> Lately, jobtracker in one of our production cluster fall into hang state.
> The load 5,10,15min is like 1 ish;
> with top command, jobtracker has 100% cpu all the time.
> 
> So, i went ahead to try top -H -p jobtracker_pid, and always see a
> thread that have 100% cpu all the time.
> 
> Unless we restart jobtracker, the hang state would never go away.
> 
> I found OOM in jobtracker log file during the hang state.
> 
> how could i know what is really going on on the one and only one
> thread that has 100% cpu.
> 
> how could i prove that we run out of memory because amount of job
> _OR_
> there is memory leak in application side. ?
> 
> 
> I tried jstack to dump, and http://jobtracker:50030/stacks
> 
> i just don't know what I should really look at output of those commands.
> 
> The cluster is cdh3u4, on Centos6.2, with disable transparent_hugepage.
> 
> 
> 
> hopefully this make sense,
> -P

RE: What to do/check/debug/root cause analysis when jobtracker hang

Reply via email to