[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744444#comment-17744444
 ] 

Yanlei Yu commented on HDFS-17093:
----------------------------------

{quote} Would you try to dig the log information at NameNode side when the 
second FBR from DataNode?
{quote}
[~hexiaoqiao] The namenode log looks like this:

 
{code:java}
DEBUG blockmanagement.BlockReportLeaseManager: Created a new BR lease 
0x954310c48050d79f for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b.  numPending = 1
TRACE blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b.
INFO BlockStateChange: BLOCK* processReport 0x13dffe1dc2f6199 with lease ID 
0x954310c48050d79f: discarded non-initial block report from 
DatanodeRegistration(*.*.*.*:50010, 
datanodeUuid=b8a8f403-4e3e-4a8a-ac51-bb512c83186b, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-56;cid=CID-c9a83fd3-b70f-498a-be5e-f2d5a34b5aee;nsid=902715697;c=0)
 because namenode still in startup phase
TRACE blockmanagement.BlockReportLeaseManager: Removed BR lease 
0x954310c48050d79f for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b.  numPending = 0
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set.
WARN blockmanagement.BlockReportLeaseManager: BR lease 0x954310c48050d79f is 
not valid for DN b8a8f403-4e3e-4a8a-ac51-bb512c83186b, because the DN is not in 
the pending set. {code}
{quote}would you mind to submit PR via Github if need?
{quote}
I tried the github commit, but it seems that I do not have the permission. This 
is the first time for me to commit a patch, and I am not sure how to operate it

 

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17093
>                 URL: https://issues.apache.org/jira/browse/HDFS-17093
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.3.4
>            Reporter: Yanlei Yu
>            Priority: Minor
>         Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> ....
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to