[ https://issues.apache.org/jira/browse/HDFS-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737101#comment-16737101 ]
He Xiaoqiao commented on HDFS-14186: ------------------------------------ Note logs about namenode and one datanode which is marked dead when namenode restart. namenode log: {code:java} 2019-01-03 02:13:16,197 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* removeDeadDatanode: lost heartbeat from datanode:50010 2019-01-03 02:13:16,198 INFO org.apache.hadoop.hdfs.server.blockmanagement.NodeStat: remove child: /ROOT/RACK/datanode:50010 2019-01-03 02:13:16,200 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /ROOT/RACK/datanode:50010 2019-01-03 02:13:43,207 INFO org.apache.hadoop.ipc.Server: IPC Server handler 25 on 8040, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReport from datanode:53518 Call#134261749 Retry#0 java.io.IOException: ProcessReport from dead or unregistered node: DatanodeRegistration(datanode:50010, datanodeUuid=8a54acea-fec9-4267-bd9b-e21cdc821787, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-57;cid=CID-13cec691-e813-4241-a752-5bbfc4342f2f;nsid=138673305;c=1417744182994) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:2457) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1525) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:176) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:33713) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) {code} datanode log: {code:java} 2019-01-03 02:12:07,202 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/ip:8040. Already tried 3 time(s); maxRetries=45 2019-01-03 02:12:27,223 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/ip:8040. Already tried 4 time(s); maxRetries=45 2019-01-03 02:12:47,242 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/ip:8040. Already tried 5 time(s); maxRetries=45 2019-01-03 02:15:18,951 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/ip:8040. Already tried 0 time(s); maxRetries=45 2019-01-03 02:15:38,953 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/ip:8040. Already tried 1 time(s); maxRetries=45 {code} > blockreport storm slow down namenode restart seriously in large cluster > ----------------------------------------------------------------------- > > Key: HDFS-14186 > URL: https://issues.apache.org/jira/browse/HDFS-14186 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: He Xiaoqiao > Assignee: He Xiaoqiao > Priority: Major > > In the current implementation, the datanode sends blockreport immediately > after register to namenode successfully when restart, and the blockreport > storm will make namenode high load to process them. One result is some > received RPC have to skip because queue time is timeout. If some datanodes' > heartbeat RPC are continually skipped for long times (default is > heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to > re-register and send blockreport again, aggravate blockreport storm and trap > in a vicious circle, and slow down (more than one hour and even more) > namenode startup seriously in a large (several thousands of datanodes) and > busy cluster especially. Although there are many work to optimize namenode > startup, the issue still exists. > I propose to postpone dead datanode check when namenode have finished startup. > Any comments and suggestions are welcome. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org