[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

Yanlei Yu (Jira) Sun, 16 Jul 2023 23:01:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743628#comment-17743628
 ]


Yanlei Yu edited comment on HDFS-15901 at 7/17/23 6:00 AM:
-----------------------------------------------------------

In our cluster of 800+ nodes, after restarting the namenode, we found that some 
datanodes did not report enough blocks, causing the namenode to stay in secure 
mode for a long time after restarting because of incomplete block reporting
I found in the logs of the datanode with incomplete block reporting that the 
first FBR attempt failed, possibly due to namenode stress, and then a second 
FBR attempt was made as follows:
{code:java}
....
2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
report(s), of which we sent 1. The reports had 1099057 total blocks and used 1 
RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
processing. Got back no commands.
2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Successfully sent block report 0x62382416f3f055,  containing 12 storage 
report(s), of which we sent 12. The reports had 1099048 total blocks and used 
12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
processing. Got back no commands.
... {code}
There's nothing wrong with that. Retry the send if it fails But on the namenode 
side of the logic:
{code:java}
         if (namesystem.isInStartupSafeMode()
          && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
          && storageInfo.getBlockReportCount() > 0) {
        blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
            + "discarded non-initial block report from {}"
            + " because namenode still in startup phase",
            strBlockReportId, fullBrLeaseId, nodeID);
        blockReportLeaseManager.removeLease(node);
        return !node.hasStaleStorages();
      } {code}
When a disk was identified as the report is not the first time, namely 
storageInfo. GetBlockReportCount > 0, Will remove the ticket from the datanode, 
lead to a second report failed because no lease


was (Author: JIRAUSER294151):
In our cluster of 800+ nodes, after restarting the namenode, we found that some 
datanodes did not report enough blocks, causing the namenode to stay in secure 
mode for a long time after restarting because of incomplete block reporting
I found in the logs of the datanode with incomplete block reporting that the 
first FBR attempt failed, possibly due to namenode stress, and then a second 
FBR attempt was made as follows:
{code:java}
//代码占位符
....
2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
report(s), of which we sent 1. The reports had 1099057 total blocks and used 1 
RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
processing. Got back no commands.
2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Successfully sent block report 0x62382416f3f055,  containing 12 storage 
report(s), of which we sent 12. The reports had 1099048 total blocks and used 
12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
processing. Got back no commands.
... {code}
There's nothing wrong with that. Retry the send if it fails But on the namenode 
side of the logic:
{code:java}
//代码占位符
         if (namesystem.isInStartupSafeMode()
          && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
          && storageInfo.getBlockReportCount() > 0) {
        blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
            + "discarded non-initial block report from {}"
            + " because namenode still in startup phase",
            strBlockReportId, fullBrLeaseId, nodeID);
        blockReportLeaseManager.removeLease(node);
        return !node.hasStaleStorages();
      } {code}
When a disk was identified as the report is not the first time, namely 
storageInfo. GetBlockReportCount > 0, Will remove the ticket from the datanode, 
lead to a second report failed because no lease

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-15901
>                 URL: https://issues.apache.org/jira/browse/HDFS-15901
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: JiangHua Zhu
>            Assignee: JiangHua Zhu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded 
> non-initial block report from DatanodeRegistration(xxxxxxxx:port, 
> datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx, 
> ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded 
> non-initial block report from DatanodeRegistration(xxxxxxxx, 
> datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx, 
> ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

Reply via email to