[jira] [Commented] (HDFS-14186) blockreport storm slow down namenode restart seriously in large cluster

He Xiaoqiao (JIRA) Fri, 25 Jan 2019 01:09:05 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752055#comment-16752055
 ]


He Xiaoqiao commented on HDFS-14186:
------------------------------------

The issue about namenode restart time-consume seriously have solved by patch 
HADOOP-12173+HDFS-9198.
The root cause is as following:
A. NetworkTopology#toString is hot point only for hadoop-2.7.1.
B. Serial BR processing affects performance when restart.

Opt A cause process RPC #proceregisterDatanode cost long time, the worst case 
like: 
{quote}2019-01-21 18:08:06,303 DEBUG org.apache.hadoop.ipc.Server: Served: 
registerDatanode queueTime= 66079 procesingTime= 3266{quote}
And QueueCall is always full, So some DataNode has to retry until register 
successfully. Stack trace like:
{quote}"IPC Server handler 40 on 8040" #149 daemon prio=5 os_prio=0 
tid=0x00007f7ff571c800 nid=0x2a9dd runnable [0x00007f19b10ce000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.net.NetworkTopology$InnerNode.getLeaf(NetworkTopology.java:340)
        at 
org.apache.hadoop.net.NetworkTopology$InnerNode.getLeaf(NetworkTopology.java:340)
        at 
org.apache.hadoop.net.NetworkTopology.toString(NetworkTopology.java:831)
        at org.apache.hadoop.net.NetworkTopology.add(NetworkTopology.java:403)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:1029)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:4741)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1487)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:97)
        at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:33709)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:847)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:790)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2458){quote}
Opt B is very understandable and quick fix by HDFS-9198 then BR RPC do not 
occupy CallQueue for long time anymore.

On test env building by dynamometer, with 40K nodes, 1.5B inodes+blocks, 
NameNode restart can finished under 1.5H.

However, another issue that mentioned above, there are still about 10min that 
service rpc CallQueue' load not decrease after NameNode safemode leave since 
BlockReport do not process completely.

> blockreport storm slow down namenode restart seriously in large cluster
> -----------------------------------------------------------------------
>
>                 Key: HDFS-14186
>                 URL: https://issues.apache.org/jira/browse/HDFS-14186
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.7.1
>            Reporter: He Xiaoqiao
>            Assignee: He Xiaoqiao
>            Priority: Major
>         Attachments: HDFS-14186.001.patch
>
>
> In the current implementation, the datanode sends blockreport immediately 
> after register to namenode successfully when restart, and the blockreport 
> storm will make namenode high load to process them. One result is some 
> received RPC have to skip because queue time is timeout. If some datanodes' 
> heartbeat RPC are continually skipped for long times (default is 
> heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to 
> re-register and send blockreport again, aggravate blockreport storm and trap 
> in a vicious circle, and slow down (more than one hour and even more) 
> namenode startup seriously in a large (several thousands of datanodes) and 
> busy cluster especially. Although there are many work to optimize namenode 
> startup, the issue still exists. 
> I propose to postpone dead datanode check when namenode have finished startup.
> Any comments and suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14186) blockreport storm slow down namenode restart seriously in large cluster

Reply via email to