[
https://issues.apache.org/jira/browse/HDFS-11225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang resolved HDFS-11225.
------------------------------------
Fix Version/s: 3.1.0
Resolution: Fixed
[~shashikant] completed the subtasks. I'm going to resolve this one to avoid
confusion.
For future reference, all the fixes were in 3.1.0 and above.
> NameNode crashed because deleteSnapshot held FSNamesystem lock too long
> -----------------------------------------------------------------------
>
> Key: HDFS-11225
> URL: https://issues.apache.org/jira/browse/HDFS-11225
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.4.0
> Environment: CDH5.8.2, HA
> Reporter: Wei-Chiu Chuang
> Assignee: Shashikant Banerjee
> Priority: Critical
> Labels: high-availability
> Fix For: 3.1.0
>
> Attachments: Snaphot_Deletion_Design_Proposal.pdf
>
>
> The deleteSnapshot operation is synchronous. In certain situations this
> operation may hold FSNamesystem lock for too long, bringing almost every
> NameNode operation to a halt.
> We have observed one incidence where it took so long that ZKFC believes the
> NameNode is down. All other IPC threads were waiting to acquire FSNamesystem
> lock. This specific deleteSnapshot took ~70 seconds. ZKFC has connection
> timeout of 45 seconds by default, and if all IPC threads wait for
> FSNamesystem lock and can't accept new incoming connection, ZKFC times out,
> advances epoch and NameNode will therefore lose its active NN role and then
> fail.
> Relevant log:
> {noformat}
> Thread 154 (IPC Server handler 86 on 8020):
> State: RUNNABLE
> Blocked count: 2753455
> Waited count: 89201773
> Stack:
>
> org.apache.hadoop.hdfs.server.namenode.INode$BlocksMapUpdateInfo.addDeleteBlock(INode.java:879)
>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.destroyAndCollectBlocks(INodeFile.java:508)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>
> org.apache.hadoop.hdfs.server.namenode.INodeReference.destroyAndCollectBlocks(INodeReference.java:339)
>
> org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.destroyAndCollectBlocks(INodeReference.java:606)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyDeletedList(DirectoryWithSnapshotFeature.java:119)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$400(DirectoryWithSnapshotFeature.java:61)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:319)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:167)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:83)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:745)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:747)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:789)
> {noformat}
> After the ZKFC determined NameNode was down and advanced epoch, the NN
> finished deleting snapshot, and sent the edit to journal nodes, but it was
> rejected because epoch was updated. See the following stacktrace:
> {noformat}
> 10.0.16.21:8485: IPC's epoch 17 is less than the last promised epoch 18
> at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
> at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457)
> at
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352)
> at
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
> at
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> at
> org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
> at
> org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
> at
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
> at
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:8507)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1469)
> at
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.deleteSnapshot(AuthorizationProviderProxyClientProtocol.java:717)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1061)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> {noformat}
> Finally NameNode shut itself down because it had too many quorum errors.
> Setting priority to critical because it resulted in NameNode crash.
> We think deleteSnapshot should be made asynchronous. It should delete the
> root of snapshot directory, and then put the rest of work into an
> asynchronous thread. Credit: [~yzhangal] for proposing this idea.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]