[ 
https://issues.apache.org/jira/browse/HDFS-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739734#comment-14739734
 ] 

Jing Zhao edited comment on HDFS-9052 at 9/10/15 10:27 PM:
-----------------------------------------------------------

A possible scenario is like this:
1. We deleted a file "foo" but it belongs to snapshot s1 so it is in s1's 
deleted list
2. Under the same directory, we did all the steps listed in the description in 
HDFS-6908, and the file created in step 3 is also named "foo". And we suppose 
the snapshots created in step 2&4 are named s2 and s3.
3. Because of the bug reported in HDFS-6908, we may have the later "foo" left 
in s2's deleted list. Then when you try to delete s2 you will hit the above 
exception.

The bug fixed in HDFS-6908 is that an INode which should be cleared is wrongly 
left in the deleted list. In HDFS-6908, because we changed the fsimage format 
in release 2.4, in the fsimage we only record INode ID in deleted list, and use 
the id to lookup the inode map. Since the real INode has been cleared from the 
INode Map, the lookup will hit NPE. You will not see NPE when loading fsimage 
in 2.3. And this conflict happens only when you have files with the same name 
("foo" in the above example).

But the above example is just one possible scenario. It's still possible that 
the issue is caused by some other bug. To bypass the issue, you may need to 
apply a temporary patch to ignore the INode in the later snapshot's delete list.



was (Author: jingzhao):
A possible scenario is like this:
1. We deleted a file "foo" but it belongs to snapshot s1 so it is in s1's 
deleted list
2. Under the same directory, we did all the steps listed in the description in 
HDFS-6908, and the file created in step 3 is also named "foo". And we suppose 
the snapshots created in step 2&4 are named s2 and s3.
3. Because of the bug reported in HDFS-6908, we may have another INodeFile left 
in s2's deleted list, whose local name is also "foo". Then when you try to 
delete s2 you will hit the above exception

The bug fixed in HDFS-6908 is that an INode which should be cleared is wrongly 
left in the deleted list. In HDFS-6908, because we changed the fsimage format 
in release 2.4, in the fsimage we only record INode ID in deleted list, and use 
the id to lookup the inode map. Since the real INode has been cleared from the 
INode Map, the lookup will hit NPE. You will not see NPE when loading fsimage 
in 2.3. And this conflict happens only when you have files with the same name 
("foo" in the above example).

But the above example is just one possible scenario. It's still possible that 
the issue is caused by some other bug. To bypass the issue, you may need to 
apply a temporary patch to ignore the INode in the later snapshot's delete list.


> deleteSnapshot runs into AssertionError
> ---------------------------------------
>
>                 Key: HDFS-9052
>                 URL: https://issues.apache.org/jira/browse/HDFS-9052
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Alex Ivanov
>
> CDH 5.0.5 upgraded from CDH 5.0.0 (Hadoop 2.3)
> Upon deleting a snapshot, we run into the following assertion error. The 
> scenario is as follows:
> 1. We have a program that deletes snapshots in reverse chronological order.
> 2. The program deletes a couple of hundred snapshots successfully but runs 
> into the following exception:
> java.lang.AssertionError: Element already exists: 
> element=useraction.log.crypto, DELETED=[useraction.log.crypto]
> 3. There seems to be an issue with that snapshot, which causes a file, which 
> normally gets overwritten in every snapshot to be added to the SnapshotDiff 
> delete queue twice.
> 4. Once the deleteSnapshot is run on the problematic snapshot, if the 
> Namenode is restarted, it cannot be started again until the transaction is 
> removed from the EditLog.
> 5. Sometimes the bad snapshot can be deleted but the prior snapshot seems to 
> "inherit" the same issue.
> 6. The error below is from Namenode starting when the DELETE_SNAPSHOT 
> transaction is replayed from the EditLog.
> 2015-09-01 22:59:59,140 INFO  [IPC Server handler 0 on 8022] BlockStateChange 
> (BlockManager.java:logAddStoredBlock(2342)) - BLOCK* addStoredBlock: blockMap 
> updated: 10.52.209.77:1004 is added to 
> blk_1080833995_7093259{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-16de62e5-f6e2-4ea7-aad9-f8567bded7d7:NORMAL|FINALIZED]]}
>  size 0
> 2015-09-01 22:59:59,140 INFO  [IPC Server handler 0 on 8022] BlockStateChange 
> (BlockManager.java:logAddStoredBlock(2342)) - BLOCK* addStoredBlock: blockMap 
> updated: 10.52.209.77:1004 is added to 
> blk_1080833996_7093260{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-1def2b07-d87f-49dd-b14f-ef230342088d:NORMAL|FINALIZED]]}
>  size 0
> 2015-09-01 22:59:59,141 ERROR [IPC Server handler 0 on 8022] 
> namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(232)) - 
> Encountered exception on operation DeleteSnapshotOp 
> [snapshotRoot=/data/tenants/pdx-svt.baseline84/wddata, 
> snapshotName=s2015022614_maintainer_soft_del, 
> RpcClientId=7942c957-a7cf-44c1-880d-6eea690e1b19, RpcCallId=1]
> 2015-09-01 22:59:59,141 ERROR [IPC Server handler 0 on 8022] 
> namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(232)) - 
> Encountered exception on operation DeleteSnapshotOp 
> [snapshotRoot=/data/tenants/pdx-svt.baseline84/wddata, 
> snapshotName=s2015022614_maintainer_soft_del, 
> RpcClientId=7942c957-a7cf-44c1-880d-6eea690e1b19, RpcCallId=1]
> java.lang.AssertionError: Element already exists: 
> element=useraction.log.crypto, DELETED=[useraction.log.crypto]
>         at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193)
>         at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239)
>         at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:293)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:303)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDeletedINode(DirectoryWithSnapshotFeature.java:531)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:823)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:684)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:830)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.removeSnapshot(INodeDirectorySnapshottable.java:341)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:238)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:667)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:224)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:133)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:802)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:783)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to