[jira] [Commented] (HDFS-14186) blockreport storm slow down namenode restart seriously in large cluster

He Xiaoqiao (JIRA) Tue, 15 Jan 2019 19:49:13 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743589#comment-16743589
 ]


He Xiaoqiao commented on HDFS-14186:
------------------------------------

[~elgoiri], thanks for your comments.
{quote}In any case, I think that if the DN is not registered, it cannot be 
marked as DEAD.{quote}
what I wanted to say was that lifeline could fix the case that datanode has 
registered when startup, but there is also another issue that datanode could 
not register successfully since namenode is overrun when namenode at startup 
progress especially in a large cluster.
{quote}Ideally a stack trace of the thread that is holding the other 
requests.{quote}
I will post stack trace soon. another side, I think namenode massive log about 
processing blockreport should be very telling. Thanks again.

> blockreport storm slow down namenode restart seriously in large cluster
> -----------------------------------------------------------------------
>
>                 Key: HDFS-14186
>                 URL: https://issues.apache.org/jira/browse/HDFS-14186
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: He Xiaoqiao
>            Assignee: He Xiaoqiao
>            Priority: Major
>         Attachments: HDFS-14186.001.patch
>
>
> In the current implementation, the datanode sends blockreport immediately 
> after register to namenode successfully when restart, and the blockreport 
> storm will make namenode high load to process them. One result is some 
> received RPC have to skip because queue time is timeout. If some datanodes' 
> heartbeat RPC are continually skipped for long times (default is 
> heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to 
> re-register and send blockreport again, aggravate blockreport storm and trap 
> in a vicious circle, and slow down (more than one hour and even more) 
> namenode startup seriously in a large (several thousands of datanodes) and 
> busy cluster especially. Although there are many work to optimize namenode 
> startup, the issue still exists. 
> I propose to postpone dead datanode check when namenode have finished startup.
> Any comments and suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14186) blockreport storm slow down namenode restart seriously in large cluster

Reply via email to