On 07/02/2012 09:04 PM, Jianhui Zhang wrote:
Hi,

I was restarting the DFS cluster. First, the DNs did not join. But if
I kept stopping and starting each DN, eventually, all DNs joined the
NN. But the NN doesn't look healthy.

The machine has 16 cores. The NN process's CPU stayed at 20% and the
"system CPU" constantly took up 50%. Here is the top output:

top - 18:01:34 up 144 days, 16:05,  5 users,  load average: 12.65, 12.06, 12.34
Tasks: 363 total,   6 running, 357 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.2%us, 48.7%sy,  0.0%ni, 28.9%id,  0.0%wa,  0.0%hi,  4.1%si,  0.0%st
Mem:  33000560k total,  6449412k used, 26551148k free,   596812k buffers
Swap: 64452600k total,        0k used, 64452600k free,  3318352k cached

And it has been in this state for a long long time - several hours.

Anybody has seen this before?

Thanks,
James

Over the weekend, many people's Hadoop systems (including mine) got hit with problems due to the leap second bug in the Linux kernel. (Which brought down many major web sites.) Perhaps your namenode got hit with that as well?

As a result of the bug, many people's java or MySQL processes began using excessive CPU. The problem happened on machines that were running NTP to do time synchronization. The solution was to either reboot the server, or (if you're not able to do a reboot for whatever reason) execute a particular date command. Either of those would clear out the erroneous state in the kernel.

I have no idea if this is in fact your issue, but figured I'd mention it since it sounded plausible.

More details here:

http://www.somebits.com/weblog/tech/bad/leap-second-2012.html

HTH,

DR


Reply via email to