[ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304503#comment-17304503
 ] 

Kihwal Lee edited comment on HDFS-15901 at 3/18/21, 10:32 PM:
--------------------------------------------------------------

The block report lease feature is supposed to improve this, but it ended up 
causing more problems in our experiences.  One of the main reasons of duplicate 
reporting is lack of ability to retransmit single report on rpc timeout.  On 
startup, the NN's call queue can be easily overwhelmed since the FBR processing 
is relatively slow. It is common to see the processing of a single storage 
taking 100s of milliseconds. A half dozen storage reports can take up a while 
second. You can easily imagine more than 60 seconds worth of reports waiting in 
the call queue, which will cause a timeout for some of the reports. 
Unfortunately, datanode's full block reporting does not retransmit the affected 
report only.  It regenerates the whole thing and start all over again.  Even if 
only the last storage FBR had a trouble, it will retransmit everything again.

The reason why it sometimes stuck in safe mode is likely the curse of the block 
report lease. When FBR is retransmitted, the feature will make the NN to drop 
the reports.  We have seen this happening in big clusters.  If the block report 
lease wasn't there, it wouldn't have stuck in safe mode.

We have recently gut out the FBR lease feature internally and implemented a new 
block report flow control system.  It was designed by [~daryn].  It hasn't been 
tested fully yet, so we haven't shared it with the community. 


was (Author: kihwal):
The block report lease feature is supposed to improve this, but it ended up 
causing more problems in our experiences.  One of the main reasons of duplicate 
reporting is lack of ability to retransmit single report on rpc timeout.  On 
startup, the NN's call queue can be easily overwhelmed since the FBR processing 
relatively slow. It is common to see a processing of a single storage taking 
100s of milliseconds. A half dozen storage reports can take up a while second. 
If you have enough in the call queue, the queue time can easily exceed the 60 
second timeout for some of the nodes. Unfortunately, datanode's full block 
reporting does not retransmit the affected report only.  It regenerates the 
whole thing and start all over again.  Even if only the last storage FBR had a 
trouble, it will retransmit everything again.

The reason why it sometimes stuck in safe mode is likely the curse of the block 
report lease. When FBR is retransmitted, the feature will make the NN to drop 
the reports.  We have seen this happening in big clusters.  If the block report 
lease wasn't there, it wouldn't have stuck in safe mode.

We have recently gut out the FBR lease feature internally and implemented a new 
block report flow control system.  It was designed by [~daryn].  It hasn't been 
tested fully yet, so we haven't shared it with the community. 

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-15901
>                 URL: https://issues.apache.org/jira/browse/HDFS-15901
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: JiangHua Zhu
>            Assignee: JiangHua Zhu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded 
> non-initial block report from DatanodeRegistration(xxxxxxxx:port, 
> datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx, 
> ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded 
> non-initial block report from DatanodeRegistration(xxxxxxxx, 
> datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx, 
> ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to