[ https://issues.apache.org/jira/browse/HDFS-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735167#comment-16735167 ]
He Xiaoqiao commented on HDFS-14186: ------------------------------------ Would like to offer more details: 1. when namenode restart, all datanode has to re-register and send blockreport for a while. 2. NameNode will leave safe mode if reported blocks (NOT replications) num reached configure block threshold (default=99.9%) * blockTotals, and num live datanodes (which is equal to registered datanodes) reached datanode threshold(default = 0). 3. when NameNode leave safe mode at startup progress, heartbeat checker will back into working order. BUT block report storm will be continue at this moment. 4. because NameNode load is very high, some heartbeat RPC from datanode will be discard as mentioned above. then some datanode may be stale even dead and re-register and send block report again. 5. it slows down restart time of NameNode.(more than 8 hours the worst case I met, 20K slaves.) The core issue: use blocks number as condition to leave safe mode rather than ALL blocks replication number. for instance, there are 12K datanodes in cluster, if ~4K=1/3*12K has reported by default configuration, NameNode may leave safe mode, because ~4K datanodes may include all blocks theoretically. However, there are 8K datanodes still reporting and namenode load is still very high currently. One solution is configuration datanode threshold (not 0 by default) but it is not stable value and limited in use. I think we can use replicationTotal rather than blockTotal as condition to leave safe mode and postpone heartbeat checker to work order. > blockreport storm slow down namenode restart seriously in large cluster > ----------------------------------------------------------------------- > > Key: HDFS-14186 > URL: https://issues.apache.org/jira/browse/HDFS-14186 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: He Xiaoqiao > Assignee: He Xiaoqiao > Priority: Major > > In the current implementation, the datanode sends blockreport immediately > after register to namenode successfully when restart, and the blockreport > storm will make namenode high load to process them. One result is some > received RPC have to skip because queue time is timeout. If some datanodes' > heartbeat RPC are continually skipped for long times (default is > heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to > re-register and send blockreport again, aggravate blockreport storm and trap > in a vicious circle, and slow down (more than one hour and even more) > namenode startup seriously in a large (several thousands of datanodes) and > busy cluster especially. Although there are many work to optimize namenode > startup, the issue still exists. > I propose to postpone dead datanode check when namenode have finished startup. > Any comments and suggestions are welcome. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org