[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220617#comment-14220617 ]
Vinayakumar B edited comment on HDFS-4882 at 11/21/14 7:16 AM: --------------------------------------------------------------- Hi [~wuzesheng], [~raviprak] and [~yzhangal], According to description, Have you able to reproduce the infinite loop in checkLeases() ? either in real cluster or using debug points in test ? Because, when I tried to reproduce, I was able to get the BlockStates, COMMITTED and COMPLETED for penultimate and last blocks respectively, But not infinite loop of checkLeases() After client shutdown, when the checkLeases() triggers, it will start block recovery for the last block. This last block recovery will indeed Succeed, but fails in {{commitBlockSynchronization()}} while closing the file due to illegal state of penultimate block. {noformat}java.lang.IllegalStateException: Failed to finalize INodeFile hardLeaseRecoveryFuzzy since blocks[0] is non-complete, where blocks=[blk_1073741825_1003{UCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-099f5e9f-cc0b-4c63-a823-7640471d08e2:NORMAL:127.0.0.1:57247|RBW]]}, blk_1073741828_1007]. at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4650) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4865) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4829) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:739){noformat} Since {{commitBlockSynchronization()}} call removes the lease in {{finalizeINodeFileUnderConstruction()}} before trying to close the file, {{checkLeases()}} will not go into infinite loop. May be it will loop till {{commitBlockSynchronization()}} call comes after recovery of last block. {code} private void finalizeINodeFileUnderConstruction(String src, INodeFile pendingFile, int latestSnapshot) throws IOException, UnresolvedLinkException { assert hasWriteLock(); FileUnderConstructionFeature uc = pendingFile.getFileUnderConstructionFeature(); Preconditions.checkArgument(uc != null); leaseManager.removeLease(uc.getClientName(), src);{code} was (Author: vinayrpet): Hi [~wuzesheng], [~raviprak] and [~yzhangal], According to description, Have you able to reproduce the infinite loop in checkLeases() ? either in real cluster or using debug points in test ? Because, when I tried to reproduce, I was able to get the BlockStates, COMMITTED and COMPLETED for penultimate and last blocks respectively, But not infinite loop of checkLeases() After client shutdown, when the checkLeases() triggers, it will start block recovery for the last block. This last block recovery will indeed Succeed, but fails in {{commitBlockSynchronization()}} while closing the file due to illegal state of penultimate block. {noformat}java.lang.IllegalStateException: Failed to finalize INodeFile hardLeaseRecoveryFuzzy since blocks[0] is non-complete, where blocks=[blk_1073741825_1003{UCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-099f5e9f-cc0b-4c63-a823-7640471d08e2:NORMAL:127.0.0.1:57247|RBW]]}, blk_1073741828_1007]. at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4650) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4865) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4829) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:739){noformat} But, since {{commitBlockSynchronization()}} removes the lease before trying to close the file in {{finalizeINodeFileUnderConstruction()}}, {{checkLeases()}} will not go into infinite loop. May be it will loop till {{commitBlockSynchronization()}} call comes after recovery of last block. {code} private void finalizeINodeFileUnderConstruction(String src, INodeFile pendingFile, int latestSnapshot) throws IOException, UnresolvedLinkException { assert hasWriteLock(); FileUnderConstructionFeature uc = pendingFile.getFileUnderConstructionFeature(); Preconditions.checkArgument(uc != null); leaseManager.removeLease(uc.getClientName(), src);{code} > Namenode LeaseManager checkLeases() runs into infinite loop > ----------------------------------------------------------- > > Key: HDFS-4882 > URL: https://issues.apache.org/jira/browse/HDFS-4882 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client, namenode > Affects Versions: 2.0.0-alpha, 2.5.1 > Reporter: Zesheng Wu > Assignee: Ravi Prakash > Priority: Critical > Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, > HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, > HDFS-4882.6.patch, HDFS-4882.patch > > > Scenario: > 1. cluster with 4 DNs > 2. the size of the file to be written is a little more than one block > 3. write the first block to 3 DNs, DN1->DN2->DN3 > 4. all the data packets of first block is successfully acked and the client > sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out > 5. DN2 and DN3 are down > 6. client recovers the pipeline, but no new DN is added to the pipeline > because of the current pipeline stage is PIPELINE_CLOSE > 7. client continuously writes the last block, and try to close the file after > written all the data > 8. NN finds that the penultimate block doesn't has enough replica(our > dfs.namenode.replication.min=2), and the client's close runs into indefinite > loop(HDFS-2936), and at the same time, NN makes the last block's state to > COMPLETE > 9. shutdown the client > 10. the file's lease exceeds hard limit > 11. LeaseManager realizes that and begin to do lease recovery by call > fsnamesystem.internalReleaseLease() > 12. but the last block's state is COMPLETE, and this triggers lease manager's > infinite loop and prints massive logs like this: > {noformat} > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: > DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard > limit > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. > Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= > /user/h_wuzesheng/test.dat > 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block > blk_-7028017402720175688_1202597, > lastBLockState=COMPLETE > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery > for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM > APREDUCE_-1252656407_1, pendingcreates: 1] > {noformat} > (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)