I have an old hadoop 0.20.2 cluster. Have not had any issues for a while.
(which is why I never bothered an upgrade)

Suddenly it OOMed last week. Now the OOMs happen periodically. We have a
fairly large NameNode heap Xmx 17GB. It is a fairly large FS about
27,000,000 files.

So the strangest thing is that every 1 and 1/2 hour the NN memory usage
increases until the heap is full.

http://imagebin.org/240287

We tried failing over the NN to another machine. We change the Java version
from 1.6_23 -> 1.7.0.

I have set the NameNode logs to debug and ALL and I have done the same with
the data nodes.
Secondary NN is running and shipping edits and making new images.

I am thinking something has corrupted the NN MetaData and after enough time
it becomes a time bomb, but this is just a total shot in the dark. Does
anyone have any interesting trouble shooting ideas?

Reply via email to