[ 
https://issues.apache.org/jira/browse/HDFS-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735167#comment-16735167
 ] 

He Xiaoqiao commented on HDFS-14186:
------------------------------------

Would like to offer more details:
1. when namenode restart, all datanode has to re-register and send blockreport 
for a while.
2. NameNode will leave safe mode if reported blocks (NOT replications) num 
reached configure block threshold (default=99.9%) * blockTotals, and num live 
datanodes (which is equal to registered datanodes) reached datanode 
threshold(default = 0).
3. when NameNode leave safe mode at startup progress, heartbeat checker will 
back into working order. BUT block report storm will be continue at this moment.
4. because NameNode load is very high, some heartbeat RPC from datanode will be 
discard as mentioned above. then some datanode may be stale even dead and 
re-register and send block report again.
5. it slows down restart time of NameNode.(more than 8 hours the worst case I 
met, 20K slaves.)

The core issue: use blocks number as condition to leave safe mode rather than 
ALL blocks replication number.
for instance, there are 12K datanodes in cluster, if ~4K=1/3*12K has reported 
by default configuration, NameNode may leave safe mode, because ~4K datanodes 
may include all blocks theoretically. However, there are 8K datanodes still 
reporting and namenode load is still very high currently.

One solution is configuration datanode threshold (not 0 by default) but it is 
not stable value and limited in use.
I think we can use replicationTotal rather than blockTotal as condition to 
leave safe mode and postpone heartbeat checker to work order.

> blockreport storm slow down namenode restart seriously in large cluster
> -----------------------------------------------------------------------
>
>                 Key: HDFS-14186
>                 URL: https://issues.apache.org/jira/browse/HDFS-14186
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: He Xiaoqiao
>            Assignee: He Xiaoqiao
>            Priority: Major
>
> In the current implementation, the datanode sends blockreport immediately 
> after register to namenode successfully when restart, and the blockreport 
> storm will make namenode high load to process them. One result is some 
> received RPC have to skip because queue time is timeout. If some datanodes' 
> heartbeat RPC are continually skipped for long times (default is 
> heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to 
> re-register and send blockreport again, aggravate blockreport storm and trap 
> in a vicious circle, and slow down (more than one hour and even more) 
> namenode startup seriously in a large (several thousands of datanodes) and 
> busy cluster especially. Although there are many work to optimize namenode 
> startup, the issue still exists. 
> I propose to postpone dead datanode check when namenode have finished startup.
> Any comments and suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to