Our cluster on cdh3u4 has the same problem. I think it is caused by some bugs in JobTracker. I believe Cloudera knows about this issue. After upgrading to cdh3u5, we havn't faced this issue yet, but I am not sure if it is confirmed to fix in the CDH3U5. Yong
> Date: Mon, 4 Feb 2013 15:21:18 -0800 > Subject: What to do/check/debug/root cause analysis when jobtracker hang > From: silvianhad...@gmail.com > To: user@hadoop.apache.org > > Lately, jobtracker in one of our production cluster fall into hang state. > The load 5,10,15min is like 1 ish; > with top command, jobtracker has 100% cpu all the time. > > So, i went ahead to try top -H -p jobtracker_pid, and always see a > thread that have 100% cpu all the time. > > Unless we restart jobtracker, the hang state would never go away. > > I found OOM in jobtracker log file during the hang state. > > how could i know what is really going on on the one and only one > thread that has 100% cpu. > > how could i prove that we run out of memory because amount of job > _OR_ > there is memory leak in application side. ? > > > I tried jstack to dump, and http://jobtracker:50030/stacks > > i just don't know what I should really look at output of those commands. > > The cluster is cdh3u4, on Centos6.2, with disable transparent_hugepage. > > > > hopefully this make sense, > -P