I have an old hadoop 0.20.2 cluster. Have not had any issues for a while. (which is why I never bothered an upgrade)
Suddenly it OOMed last week. Now the OOMs happen periodically. We have a fairly large NameNode heap Xmx 17GB. It is a fairly large FS about 27,000,000 files. So the strangest thing is that every 1 and 1/2 hour the NN memory usage increases until the heap is full. http://imagebin.org/240287 We tried failing over the NN to another machine. We change the Java version from 1.6_23 -> 1.7.0. I have set the NameNode logs to debug and ALL and I have done the same with the data nodes. Secondary NN is running and shipping edits and making new images. I am thinking something has corrupted the NN MetaData and after enough time it becomes a time bomb, but this is just a total shot in the dark. Does anyone have any interesting trouble shooting ideas?