[ 
https://issues.apache.org/jira/browse/HDFS-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-17604:
-----------------------------
    Environment:     (was: We meet a corner case that sometimes EC block 
deletion under HDFS snapshot could make NameNode crashed.

The stacktrace error:
{noformat}
2024-07-10 23:17:47,665 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation DeleteOp [length=0, path=xxxx, timestamp=1720678635100, 
RpcClientId=5161c587-9102-41cf-b823-fe618db9ab4c, RpcCallId=177, 
opCode=OP_DELETE, txid=55577688248]
java.lang.IllegalStateException
        at 
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:494)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.collectBlocksBeyondSnapshot(INodeFile.java:1225)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.collectBlocksAndClear(FileWithSnapshotFeature.java:240)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.cleanFile(FileWithSnapshotFeature.java:134)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.cleanSubtree(INodeFile.java:754)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:714)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyCreatedList(DirectoryWithSnapshotFeature.java:75)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$800(DirectoryWithSnapshotFeature.java:48)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.destroyDstSubtree(DirectoryWithSnapshotFeature.java:423)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:720)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.unprotectedDelete(FSDirDeleteOp.java:258)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.deleteForEditLog(FSDirDeleteOp.java:143)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:630)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:288)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:183)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:915)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:364)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:505)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:451)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:468)
{noformat}
The reason for this is that we assume that EC block deletion will not hit below 
truncate code logic since EC doesn't support truncate method.

*INodeFile#collectBlocksBeyondSnapshot*
{noformat}
/**
 * This function is only called when block list is stored in snapshot
 * diffs. Note that this can only happen when truncation happens with
 * snapshots. Since we do not support truncation with striped blocks,
 * we only need to handle contiguous blocks here.
 */
public void collectBlocksBeyondSnapshot(BlockInfo[] snapshotBlocks,
                                        BlocksMapUpdateInfo collectedBlocks) {
  Preconditions.checkState(!isStriped());   <=== error throw here
  BlockInfo[] oldBlocks = getBlocks();
  if(snapshotBlocks == null || oldBlocks == null)
    return;
  ...
  }
}
{noformat}
But there is a special case that EC block deletion under snapshot can hit this 
case, we can reproduce this issue by following below steps:

1) Created a EC folder and trigger the DistCp job to do the data copy into this 
folder. This EC folder is also enabled with HDFS snapshot.
2) During the EC data write, we try to  create a new Snapshot.
3) Kill the running DistCp job that submitted in step1.
4) Delete the broken EC file that copied in above step. Standby NN will failed 
due to above error.)

> EC block deletion under snapshot makes NameNode crashed
> -------------------------------------------------------
>
>                 Key: HDFS-17604
>                 URL: https://issues.apache.org/jira/browse/HDFS-17604
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec, erasure-coding
>    Affects Versions: 3.3.3
>            Reporter: Yiqun Lin
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to