[jira] [Commented] (HDFS-17358) EC: infinite lease recovery caused by the length of RWR equals to zero.

ASF GitHub Bot (Jira) Wed, 21 Feb 2024 19:56:02 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819464#comment-17819464
 ]


ASF GitHub Bot commented on HDFS-17358:
---------------------------------------

zhangshuyan0 commented on code in PR #6509:
URL: https://github.com/apache/hadoop/pull/6509#discussion_r1498579433


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java:
##########
@@ -436,9 +442,17 @@ protected void recover() throws IOException {
                   "datanode={})", block, internalBlk, id, e);
         }
       }
-      checkLocations(syncBlocks.size());
 
-      final long safeLength = getSafeLength(syncBlocks);
+      final long safeLength;
+      if (dnNotHaveReplicaCnt + zeroLenReplicaCnt <= locs.length - 
ecPolicy.getNumDataUnits()) {
+        checkLocations(syncBlocks.size());
+        safeLength = getSafeLength(syncBlocks);
+      } else {
+        safeLength = 0;
+        LOG.warn("Block recovery: More than {} datanodes do not have the 
replica of block {}." +

Review Comment:
   What does this "More than" mean?





> EC: infinite lease recovery caused by the length of RWR equals to zero.
> -----------------------------------------------------------------------
>
>                 Key: HDFS-17358
>                 URL: https://issues.apache.org/jira/browse/HDFS-17358
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec
>            Reporter: farmmamba
>            Assignee: farmmamba
>            Priority: Major
>              Labels: pull-request-available
>
> Recently, there is a strange case happened on our ec production cluster.
> The phenomenon is as below described: NameNode does infinite recovery lease 
> of some ec files(~80K+) and those files could never be closed.
>  
> After digging into logs and releated code, we found the root cause is below 
> codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
> {code:java}
>           // we met info.getNumBytes==0 here! 
>           if (info != null &&
>               info.getGenerationStamp() >= block.getGenerationStamp() &&
>               info.getNumBytes() > 0) {
>             final BlockRecord existing = syncBlocks.get(blockId);
>             if (existing == null ||
>                 info.getNumBytes() > existing.rInfo.getNumBytes()) {
>               // if we have >1 replicas for the same internal block, we
>               // simply choose the one with larger length.
>               // TODO: better usage of redundant replicas
>               syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
>             }
>           }
>           // throw exception here!
>           checkLocations(syncBlocks.size());
> {code}
> The related logs are as below:
> {code:java}
> java.io.IOException: 
> BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 
> has no enough internal blocks, unable to start recovery. Locations=[...] 
> {code}
> {code:java}
> 2024-01-23 12:48:16,171 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365, 
> replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR 
> getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() = 
> /data25/hadoop/hdfs/datanode getBlockURI() = 
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686
>  recoveryId=27529675 original=ReplicaWaitingToBeRecovered, 
> blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0 
> getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode 
> getBlockURI() = 
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686
> {code}
> because the length of RWR is zero,  the length of the returned object in 
> below codes is zero. We can't put it into syncBlocks.
> So throw exception in checkLocations method.
> {code:java}
>           ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN,
>               new RecoveringBlock(internalBlk, null, recoveryId)); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17358) EC: infinite lease recovery caused by the length of RWR equals to zero.

Reply via email to