[ https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542370#comment-17542370 ]
JiangHua Zhu commented on HDFS-16594: ------------------------------------- Thanks [~sodonnell] and [~weichiu] for your comments and following. > Many RpcCalls are blocked for a while while Decommission works > -------------------------------------------------------------- > > Key: HDFS-16594 > URL: https://issues.apache.org/jira/browse/HDFS-16594 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Affects Versions: 2.9.2 > Reporter: JiangHua Zhu > Assignee: JiangHua Zhu > Priority: Major > Attachments: image-2022-05-26-02-05-38-878.png > > > When there are some DataNodes that need to go offline, Decommission starts to > work, and periodically checks the number of blocks remaining to be processed. > By default, when checking more than > 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the > DatanodeAdminDefaultMonitor thread will sleep for a while before continuing. > If the number of blocks to be checked is very large, for example, the number > of replicas managed by the DataNode reaches 90w or even 100w, during this > period, the DatanodeAdminDefaultMonitor will continue to hold the > FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs: > !image-2022-05-26-02-05-38-878.png! > It can be seen that in the last inspection process, there were more than 100w > blocks. > When the check is over, FSNamesystemLock is released and RpcCall starts > working: > ' > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 36 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process > from client Call#5571549 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > xxxx.xxxx.xxxx.xxxx:35727 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 135 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process > from client Call#36795561 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > xxxx.xxxx.xxxx.xxxx:37793 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 108 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process > from client Call#5497586 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > xxxx.xxxx.xxxx.xxxx:23475 > ' > ' > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 33 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process > from client Call#6043903 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > xxxx.xxxx.xxxx.xxxx:34746 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 139 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process > from client Call#274471 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > xxxx.xxxx.xxxx.xxxx:46419 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 77 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process > from client Call#73375524 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > xxxx.xxxx.xxxx.xxxx:34241 > ' > Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will > be longer than usual. A very large number of RpcCalls were affected during > this time. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org