HDFS version is 2.7.7 We have 500+ nodes, 230 million files and directories, 270 million blocks, 128GB memory for namenode. Recently namenode became unstable, and failed over 5-10 times everyday.
According to the jstack, I cannot find any stuck thread. It seems that the namenode just cannot handle the requests in time because RUNNABLE threads are changed every time I print the jstack. It is like: "IPC Server handler 74 on 8020" daemon prio=10 tid=0x00007f5cf4f31000 nid=0x44c5 runnable [0x00007f3ab2fed000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor$BlockIterator.next(DatanodeDescriptor.java:542) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlocksWithLocations(BlockManager.java:1069) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlocks(BlockManager.java:1044) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlocks(NameNodeRpcServer.java:481) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.getBlocks(NamenodeProtocolServerSideTranslatorPB.java:86) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12017) We have 200 rpc hanlders and do not use service-rpc. Is it helpful to enable the service-rpc? or any other suggestions? Do let me know if you need other information. Many thanks. -- Best Regards! Wenqi