[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2021-03-17 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303427#comment-17303427
 ] 

JiangHua Zhu edited comment on HDFS-15901 at 3/17/21, 1:44 PM:
---

[~weichiu] [~hexiaoqiao], can you help review the code?
thank you very much.


was (Author: jianghuazhu):
[~hexiaoqiao], can you help review the code?
thank you very much.

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2021-03-18 Thread Kihwal Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304503#comment-17304503
 ] 

Kihwal Lee edited comment on HDFS-15901 at 3/18/21, 10:32 PM:
--

The block report lease feature is supposed to improve this, but it ended up 
causing more problems in our experiences.  One of the main reasons of duplicate 
reporting is lack of ability to retransmit single report on rpc timeout.  On 
startup, the NN's call queue can be easily overwhelmed since the FBR processing 
is relatively slow. It is common to see the processing of a single storage 
taking 100s of milliseconds. A half dozen storage reports can take up a while 
second. You can easily imagine more than 60 seconds worth of reports waiting in 
the call queue, which will cause a timeout for some of the reports. 
Unfortunately, datanode's full block reporting does not retransmit the affected 
report only.  It regenerates the whole thing and start all over again.  Even if 
only the last storage FBR had a trouble, it will retransmit everything again.

The reason why it sometimes stuck in safe mode is likely the curse of the block 
report lease. When FBR is retransmitted, the feature will make the NN to drop 
the reports.  We have seen this happening in big clusters.  If the block report 
lease wasn't there, it wouldn't have stuck in safe mode.

We have recently gut out the FBR lease feature internally and implemented a new 
block report flow control system.  It was designed by [~daryn].  It hasn't been 
tested fully yet, so we haven't shared it with the community. 


was (Author: kihwal):
The block report lease feature is supposed to improve this, but it ended up 
causing more problems in our experiences.  One of the main reasons of duplicate 
reporting is lack of ability to retransmit single report on rpc timeout.  On 
startup, the NN's call queue can be easily overwhelmed since the FBR processing 
relatively slow. It is common to see a processing of a single storage taking 
100s of milliseconds. A half dozen storage reports can take up a while second. 
If you have enough in the call queue, the queue time can easily exceed the 60 
second timeout for some of the nodes. Unfortunately, datanode's full block 
reporting does not retransmit the affected report only.  It regenerates the 
whole thing and start all over again.  Even if only the last storage FBR had a 
trouble, it will retransmit everything again.

The reason why it sometimes stuck in safe mode is likely the curse of the block 
report lease. When FBR is retransmitted, the feature will make the NN to drop 
the reports.  We have seen this happening in big clusters.  If the block report 
lease wasn't there, it wouldn't have stuck in safe mode.

We have recently gut out the FBR lease feature internally and implemented a new 
block report flow control system.  It was designed by [~daryn].  It hasn't been 
tested fully yet, so we haven't shared it with the community. 

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is no

[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2021-03-19 Thread Kihwal Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304933#comment-17304933
 ] 

Kihwal Lee edited comment on HDFS-15901 at 3/19/21, 2:18 PM:
-

[~jianghuazhu], when it happens again, take a look at the datanodes tab of the 
NN web UI. If you sort by number of blocks, you can figure out who hasn't sent 
FBR yet. Those nodes will have very low block count or 0 for their block pool 
used percentage.  You can try manually triggering a FBR for those nodes. This 
might work, but the block report lease manager can get in the way.  In that 
case, the datanode can be restarted to force re-registration and obtain a new 
fbr lease.


was (Author: kihwal):
[~jianghuazhu], when it happens again, take a look at the datanodes tab of the 
NN web UI. If you sort by number of blocks, you can figure out who hasn't sent 
FBR yet. Those nodes will have very low block count or 0 for their block pool 
used percentage.  You can try manually triggering a FBR for those nodes. This 
might work, but the block report lease manager can get in the way.

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2023-01-07 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655752#comment-17655752
 ] 

Xing Lin edited comment on HDFS-15901 at 1/8/23 1:42 AM:
-

Do we have any followup on this issue? 

We are seeing a similar issue happening at Linkedin as well. The standby NN can 
be stuck in safe mode when restarted for some of the large clusters. When NN 
stuck in safe mode, the number of missing blocks each time are different and 
they are small numbers, from ~800 to 10K. It does not seem that we are missing 
a FBR. We are not sure what is causing the issue but could the following 
hypothesis be the case? 

In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a 
later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send 
in a new incremental Block report (IBR). However, NN does not process these 
IBRs (for example, it is paused due to GC). NN will not process any non-initial 
FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from 
the cluster and blockA becomes the missing block it will wait forever. 

 


was (Author: xinglin):
Do we have any followup on this issue? 

We are seeing a similar issue happening at Linkedin as well. The standby NN can 
be stuck in safe mode when restarted for some of the large clusters. When NN 
stuck in safe mode, the number of missing blocks each time are different. We 
are not sure what is causing the issue but could the following hypothesis be 
the case? 

In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a 
later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send 
in a new incremental Block report (IBR). However, NN does not process these 
IBRs (for example, it is paused due to GC). NN will not process any non-initial 
FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from 
the cluster and blockA becomes the missing block it will wait forever. 

 

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(:port, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xe: discarded 
> non-initial block report from DatanodeRegistration(, 
> datanodeUuid=, infoPort=, infoSecurePort=, 
> ipcPort=, storageInfo=lv=;nsid=;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0x is not valid for 
> DN , because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0x is not valid for 
> DN , because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

2023-07-16 Thread Yanlei Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743628#comment-17743628
 ] 

Yanlei Yu edited comment on HDFS-15901 at 7/17/23 6:00 AM:
---

In our cluster of 800+ nodes, after restarting the namenode, we found that some 
datanodes did not report enough blocks, causing the namenode to stay in secure 
mode for a long time after restarting because of incomplete block reporting
I found in the logs of the datanode with incomplete block reporting that the 
first FBR attempt failed, possibly due to namenode stress, and then a second 
FBR attempt was made as follows:
{code:java}

2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
report(s), of which we sent 1. The reports had 1099057 total blocks and used 1 
RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
processing. Got back no commands.
2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Successfully sent block report 0x62382416f3f055,  containing 12 storage 
report(s), of which we sent 12. The reports had 1099048 total blocks and used 
12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
processing. Got back no commands.
... {code}
There's nothing wrong with that. Retry the send if it fails But on the namenode 
side of the logic:
{code:java}
 if (namesystem.isInStartupSafeMode()
          && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
          && storageInfo.getBlockReportCount() > 0) {
        blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
            + "discarded non-initial block report from {}"
            + " because namenode still in startup phase",
            strBlockReportId, fullBrLeaseId, nodeID);
        blockReportLeaseManager.removeLease(node);
        return !node.hasStaleStorages();
      } {code}
When a disk was identified as the report is not the first time, namely 
storageInfo. GetBlockReportCount > 0, Will remove the ticket from the datanode, 
lead to a second report failed because no lease


was (Author: JIRAUSER294151):
In our cluster of 800+ nodes, after restarting the namenode, we found that some 
datanodes did not report enough blocks, causing the namenode to stay in secure 
mode for a long time after restarting because of incomplete block reporting
I found in the logs of the datanode with incomplete block reporting that the 
first FBR attempt failed, possibly due to namenode stress, and then a second 
FBR attempt was made as follows:
{code:java}
//代码占位符

2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
report(s), of which we sent 1. The reports had 1099057 total blocks and used 1 
RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
processing. Got back no commands.
2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Successfully sent block report 0x62382416f3f055,  containing 12 storage 
report(s), of which we sent 12. The reports had 1099048 total blocks and used 
12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
processing. Got back no commands.
... {code}
There's nothing wrong with that. Retry the send if it fails But on the namenode 
side of the logic:
{code:java}
//代码占位符
 if (namesystem.isInStartupSafeMode()
          && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
          && storageInfo.getBlockReportCount() > 0) {
        blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
            + "discarded non-initial block report from {}"
            + " because namenode still in startup phase",
            strBlockReportId, fullBrLeaseId, nodeID);
        blockReportLeaseManager.removeLease(node);
        return !node.hasStaleStorages();
      } {code}
When a disk was identified as the report is not the first time, namely 
storageInfo. GetBlockReportCount > 0, Will remove the ticket from the datanode, 
lead to a second report failed because no lease

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
>