[ https://issues.apache.org/jira/browse/HDFS-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HDFS-17358: ---------------------------------- Labels: pull-request-available (was: ) > EC: infinite lease recovery caused by the length of RWR equals to zero. > ----------------------------------------------------------------------- > > Key: HDFS-17358 > URL: https://issues.apache.org/jira/browse/HDFS-17358 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec > Reporter: farmmamba > Assignee: farmmamba > Priority: Major > Labels: pull-request-available > > Recently, there is a strange case happened on our ec production cluster. > The phenomenon is as below described: NameNode does infinite recovery lease > of some ec files(~80K+) and those files could never be closed. > > After digging into logs and releated code, we found the root cause is below > codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`: > {code:java} > // we met info.getNumBytes==0 here! > if (info != null && > info.getGenerationStamp() >= block.getGenerationStamp() && > info.getNumBytes() > 0) { > final BlockRecord existing = syncBlocks.get(blockId); > if (existing == null || > info.getNumBytes() > existing.rInfo.getNumBytes()) { > // if we have >1 replicas for the same internal block, we > // simply choose the one with larger length. > // TODO: better usage of redundant replicas > syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info)); > } > } > // throw exception here! > checkLocations(syncBlocks.size()); > {code} > The related logs are as below: > {code:java} > java.io.IOException: > BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 > has no enough internal blocks, unable to start recovery. Locations=[...] > {code} > {code:java} > 2024-01-23 12:48:16,171 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: > initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365, > replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR > getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() = > /data25/hadoop/hdfs/datanode getBlockURI() = > file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686 > recoveryId=27529675 original=ReplicaWaitingToBeRecovered, > blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0 > getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode > getBlockURI() = > file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686 > {code} > because the length of RWR is zero, the length of the returned object in > below codes is zero. We can't put it into syncBlocks. > So throw exception in checkLocations method. > {code:java} > ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN, > new RecoveringBlock(internalBlk, null, recoveryId)); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org