Re: unhealthy NN after startup

David Rosenstrauch Tue, 03 Jul 2012 08:46:09 -0700

On 07/02/2012 09:04 PM, Jianhui Zhang wrote:

Hi,


I was restarting the DFS cluster. First, the DNs did not join. But if
I kept stopping and starting each DN, eventually, all DNs joined the
NN. But the NN doesn't look healthy.

The machine has 16 cores. The NN process's CPU stayed at 20% and the
"system CPU" constantly took up 50%. Here is the top output:

top - 18:01:34 up 144 days, 16:05,  5 users,  load average: 12.65, 12.06, 12.34
Tasks: 363 total,   6 running, 357 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.2%us, 48.7%sy,  0.0%ni, 28.9%id,  0.0%wa,  0.0%hi,  4.1%si,  0.0%st
Mem:  33000560k total,  6449412k used, 26551148k free,   596812k buffers
Swap: 64452600k total,        0k used, 64452600k free,  3318352k cached

And it has been in this state for a long long time - several hours.

Anybody has seen this before?

Thanks,
James

Over the weekend, many people's Hadoop systems (including mine) got hitwith problems due to the leap second bug in the Linux kernel. (Whichbrought down many major web sites.) Perhaps your namenode got hit withthat as well?

As a result of the bug, many people's java or MySQL processes beganusing excessive CPU. The problem happened on machines that were runningNTP to do time synchronization. The solution was to either reboot theserver, or (if you're not able to do a reboot for whatever reason)execute a particular date command. Either of those would clear out theerroneous state in the kernel.

I have no idea if this is in fact your issue, but figured I'd mention itsince it sounded plausible.


More details here:

http://www.somebits.com/weblog/tech/bad/leap-second-2012.html

HTH,

DR

Re: unhealthy NN after startup

Reply via email to