hdfs namenode fails over frequently due to timeout with zkfc

Wenqi Ma Wed, 18 Sep 2019 20:47:08 -0700

HDFS version is 2.7.7

We have 500+ nodes, 230 million files and directories, 270 million blocks,
128GB memory for namenode. Recently namenode became unstable, and failed
over 5-10 times everyday.


According to the jstack, I cannot find any stuck thread. It seems that the
namenode just cannot handle the requests in time because RUNNABLE threads
are changed every time I print the jstack. It is like:
"IPC Server handler 74 on 8020" daemon prio=10 tid=0x00007f5cf4f31000
nid=0x44c5 runnable [0x00007f3ab2fed000]
   java.lang.Thread.State: RUNNABLE

    at
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor$BlockIterator.next(DatanodeDescriptor.java:542)
    at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlocksWithLocations(BlockManager.java:1069)
    at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlocks(BlockManager.java:1044)

    at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlocks(NameNodeRpcServer.java:481)
    at
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.getBlocks(NamenodeProtocolServerSideTranslatorPB.java:86)
    at
org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12017)

We have 200 rpc hanlders and do not use service-rpc. Is it helpful to
enable the service-rpc? or any other suggestions?
Do let me know if you need other information.
Many thanks.
-- 
Best Regards!
Wenqi

hdfs namenode fails over frequently due to timeout with zkfc

Reply via email to