[
https://issues.apache.org/jira/browse/HDFS-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tao Li resolved HDFS-17358.
---------------------------
Fix Version/s: 3.5.0
Resolution: Fixed
> EC: infinite lease recovery caused by the length of RWR equals to zero.
> -----------------------------------------------------------------------
>
> Key: HDFS-17358
> URL: https://issues.apache.org/jira/browse/HDFS-17358
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ec
> Reporter: farmmamba
> Assignee: farmmamba
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Recently, there is a strange case happened on our ec production cluster.
> The phenomenon is as below described: NameNode does infinite recovery lease
> of some ec files(~80K+) and those files could never be closed.
>
> After digging into logs and releated code, we found the root cause is below
> codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
> {code:java}
> // we met info.getNumBytes==0 here!
> if (info != null &&
> info.getGenerationStamp() >= block.getGenerationStamp() &&
> info.getNumBytes() > 0) {
> final BlockRecord existing = syncBlocks.get(blockId);
> if (existing == null ||
> info.getNumBytes() > existing.rInfo.getNumBytes()) {
> // if we have >1 replicas for the same internal block, we
> // simply choose the one with larger length.
> // TODO: better usage of redundant replicas
> syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
> }
> }
> // throw exception here!
> checkLocations(syncBlocks.size());
> {code}
> The related logs are as below:
> {code:java}
> java.io.IOException:
> BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828
> has no enough internal blocks, unable to start recovery. Locations=[...]
> {code}
> {code:java}
> 2024-01-23 12:48:16,171 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
> initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365,
> replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR
> getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() =
> /data25/hadoop/hdfs/datanode getBlockURI() =
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686
> recoveryId=27529675 original=ReplicaWaitingToBeRecovered,
> blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0
> getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode
> getBlockURI() =
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686
> {code}
> because the length of RWR is zero, the length of the returned object in
> below codes is zero. We can't put it into syncBlocks.
> So throw exception in checkLocations method.
> {code:java}
> ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN,
> new RecoveringBlock(internalBlk, null, recoveryId)); {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]