[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222629#comment-14222629 ] Colin Patrick McCabe commented on HDFS-4882: Hi [~vinayrpet], This change makes the code more robust because it avoids going into an infinite loop in the case where the lease is not removed from {{LeaseManager#leases}} during the loop body. The change doesn't harm anything... things are just as efficient as before, and in the unlikely case that we can't remove the lease, we log a warning message so we are aware of the problem. Why don't we continue the discussion about the sequence of operations that could trigger this over on HDFS-7342? And commit this in the meantime to fix the immediate problem for [~raviprak]. I am +1, any objections to committing this tomorrow? Also, [~yzhangal], can you comment on whether you have also observed this bug? Vinayakumar seems to be questioning whether this loop can occur, but I thought you had seen the LeaseManager thread loop in the field... I apologize if I'm putting words in your mouth, though. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.7.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222680#comment-14222680 ] Vinayakumar B commented on HDFS-4882: - bq. This change makes the code more robust because it avoids going into an infinite loop in the case where the lease is not removed from LeaseManager#leases during the loop body. The change doesn't harm anything... things are just as efficient as before, and in the unlikely case that we can't remove the lease, we log a warning message so we are aware of the problem Yes, you are right. Even though I don't see the possibility of infinite loop by considering existing code in trunk, changes made in the patch looks pretty cool. I am +1 for the change. {quote}Why don't we continue the discussion about the sequence of operations that could trigger this over on HDFS-7342? And commit this in the meantime to fix the immediate problem for Ravi Prakash. I am +1, any objections to committing this tomorrow? Also, Yongjun Zhang, can you comment on whether you have also observed this bug? Vinayakumar seems to be questioning whether this loop can occur, but I thought you had seen the LeaseManager thread loop in the field... I apologize if I'm putting words in your mouth, though.{quote} Yes, lets continue this discussion in HDFS-7342 Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.7.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222737#comment-14222737 ] Yongjun Zhang commented on HDFS-4882: - Hi [~cmccabe] and [~vinayrpet], Thanks for your comments. {quote} can you comment on whether you have also observed this bug? {quote} Yes, I did observe a similar infinite loop, and by studying the code, I concluded that the case I was looking at has exactly the same root cause as the one reported here. Please see details described in my earlier comment at https://issues.apache.org/jira/browse/HDFS-4882?focusedCommentId=14213992page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14213992, the several comments after that. In short, when the penultimate block is COMMITTED and the last block is COMPLETE, the following block of code will be executed {code} switch(lastBlockState) { case COMPLETE: assert false : Already checked that the last block is incomplete; break; {code} and return back to LeaseManager without releasing the corresponding lease, which stays as the first element in {{sortedLeases}}. The leaseManager keeps examining the first entry in sortedLease again and again, while holding the FSNamesystem#writeLock, thus causing the infinite loop. {quote} Yes, you are right. Even though I don't see the possibility of infinite loop by considering existing code in trunk, changes made in the patch looks pretty cool. {quote} See above for the explanation about infinite loop in the existing code. {quote} Yes, lets continue this discussion in HDFS-7342 {quote} In HDFS-7342, Ravi worked out a testcase to demonstrate the problem and I suggested a solution. Thanks in advance for your review and comments there. Hope we can get to a converged solution soon. Since avoiding infinite loop is just part of the complete solution, and the other part is to get the lease released, which is what HDFS-7342 tries to address. Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.7.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220680#comment-14220680 ] Colin Patrick McCabe commented on HDFS-4882: [~yzhangal]: I agree, we should fix the {{LOG.warn(Unable to release hard-limit expired lease ...}} code a bit. I agree with the idea of replacing the {[return}} in the loop body with a {{break}}, and putting the LOG at the end. One way of looking at this is that we're trying to enforce the invariant that if there are any leases left at the end of the function, they should be unexpired leases. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220705#comment-14220705 ] Vinayakumar B commented on HDFS-4882: - Test added in HDFS-7342 is not the real test. blocks states are set from the testcode itself. Not occured on operation. {{lm.setLeasePeriod(0, 0);}} is the one which will lead to infinite loop in that testcode. Here are the following sequence of operations happens on lease recovery on hard limit expiry. 1. Last block will be in *UNDER_CONSTRUCTION or UNDER_RECOVERY* state initially, since the block is not full and also file is not closed yet. 2. Check the block states, if last block is in UNDER_CONSTRUCTION/UNDER_RECOVERY state, then *re-assign* the lease with HDFS-Namenode. Here original expired lease will be removed and Fresh HDFS-Namenode will be added. Note that *This lease is not expired*. 3. Queues the block to be recovered in primary datanode. 4. In LeaseManager, {{sortedLeases}} will have HDFS-Namenode lease will be present, instead of the original expired lease. *So checkLeases() will not enter infinite loop* 5. Now {{commitBlockSynchronization()}} after recovery of the last block, *will make the last block COMPLETED, removes the lease*, but fails to close the file because of not meeting min replication for penultimate block. Since the lease is removed already, checkLeases() will not enter infinite loop. Interestingly, I tried restart of Namenode after above steps, then while loading edits, all blocks except last block will be treated as COMPLETE. So penultimate block was in COMPLETE state. Hence Leaserecovery was successfull for the same file. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221046#comment-14221046 ] Yongjun Zhang commented on HDFS-4882: - Thanks [~cmccabe] and [~vinayrpet] for the review and further look. HI Vinayakumar, Yes, the testcase Ravi created in HDFS-7342 is a bit artificial, but it helps to reproduce the infinite loop case we are dealing with here. Setting the leasePeriod to 0 is just to trigger the lease recovery thus the infinite loop in this test. The steps you described appear to be a different scenario than what we are addressing in this jira. The case here is that the lastBlock is already in COMPELTE state when the hard lease limit expires. Your case is that the last block is not yet in COMPLETE state. When the last block is in *UNDER_CONSTRUCTION or UNDER_RECOVERY* state, it's handled differently in {{FSNamesystem#internalReleaseLease}}, as you described, I believe that's why you don't see the infinitely loop. What you described is an interesting case too though. for the failure in step 5 you described, would you please create a different jira for your case? Thanks. HI [~raviprak], would you please help addressing my earlier comments that Colin agreed upon? thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221393#comment-14221393 ] Hadoop QA commented on HDFS-4882: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12682913/HDFS-4882.7.patch against trunk revision c298a9a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8801//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8801//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.7.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221458#comment-14221458 ] Yongjun Zhang commented on HDFS-4882: - Hi Ravi, many thanks for the new rev! It looks good to me. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.7.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219599#comment-14219599 ] Yongjun Zhang commented on HDFS-4882: - Thanks a lot [~cmccabe], good comments! Hi [~raviprak], many thanks for your earlier work. Without adding the warn message Colin suggested, there will still be a message reported every 2 seconds like Recovering lease... when the lease can't be released, but adding the warn message would help us identify a similar issue easier. This is a quite urgent issue for the case I was looking at, I'd really appreciate that you could address Colin's comments at your earliest convenience. Thanks! Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219746#comment-14219746 ] Yongjun Zhang commented on HDFS-4882: - Hi Ravi, Thanks for responding quickly! I looked at the change, a minor suggestion: {code} 470 if(leaseToCheck != sortedLeases.first()) { 471 LOG.warn(Unable to release the oldest lease: + sortedLeases.first()); 472 } {code} the lease that can not be released may not be the oldest in the loop. Maybe we can change to Unable to release hard-limit expired lease: to be more accurate. On the other hand, all older leases are released already in this checkReleases() run, this un-released lease will be the oldest lease 2 seconds later:-) Another thought is, in each checkLeases run, there might be multiple leases that can not be released, we only WARN one here with your latest change. If we replace the above code with a loop to check not only the first, but also the second, the third, until it hits leaseToCheck, then we are able to report all leases that are not released. Just a thought, not necessarily we have to do it in this jira, because having one is already a good indicator, and we do have the report every 2 second for all leases we attempt to release. Hi Colin: further comments? Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219919#comment-14219919 ] Ravi Prakash commented on HDFS-4882: Hi Yongjun! sortedLeases is modified while iterating through the loop. So the sortedLease.first() is necessarily the oldest lease and if leaseToCheck != sortedLeases.first(), then it means we are looking to release a lease younger than the oldest. I thought about logging all the leases which couldn't be released, but considering that we expect this to be a rare occurrence, I didn't see the cost-benefit in that extra code which will probably never run Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219936#comment-14219936 ] Yongjun Zhang commented on HDFS-4882: - HI Ravi, I agree that we don't really need to WARN all the leases. What I meant by it may not be the oldest is, in each run of checkReleases (which happens every 2 seconds), the first (and the real oldest in this loop) lease examined in the loop can be released and removed from sortedLease, then we move on to next item in the sortedLease. In that sense, a un-released lease may not be the oldest among all the leases examined in this loop. Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220013#comment-14220013 ] Hadoop QA commented on HDFS-4882: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12682699/HDFS-4882.5.patch against trunk revision a9a0cc3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8789//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8789//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220198#comment-14220198 ] Yongjun Zhang commented on HDFS-4882: - Hi Ravi, many thanks for addressing Colin's and my comments quickly! Hi [~cmccabe], the latest patch looks good to me. Would you please help taking a further look? thanks a lot. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220255#comment-14220255 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12682760/HDFS-4882.6.patch against trunk revision eb4045e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in hadoop-hdfs-project/hadoop-hdfs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8796//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8796//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220438#comment-14220438 ] Yongjun Zhang commented on HDFS-4882: - Sorry Ravi, after thinking a bit more, I think there is still a hole here, If the sortedLease has a single one, and it's not able to releaes, the the code below won't reach, and the WARN won't be printed out. {code} 470 if(leaseToCheck != sortedLeases.first()) { 471 LOG.warn(Unable to release hard-limit expired lease: 472 + sortedLeases.first()); 473 } {code} I think we should move this code to the very bottom of the method, to before it returns: {code} if (leaseToCheck != sortedLeases.first()) { LOG.warn(Unable to release hard-limit expired lease: + sortedLeases.first()); } return needSync; {code} If sortedLease is empty, leaseToCheck would be null, so we are good; if sortedLeaes has only one item which can't be released, leaseCheck should be assigned to null at the end of the loop, we are good too. Othercases should be covered too. Right? Thanks. {code} Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220473#comment-14220473 ] Yongjun Zhang commented on HDFS-4882: - And replace the {{return needSync;}} inside the loop with {{break;}}. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220617#comment-14220617 ] Vinayakumar B commented on HDFS-4882: - Hi [~wuzesheng], [~raviprak] and [~yzhangal], According to description, Have you able to reproduce the infinite loop in checkLeases() ? either in real cluster or using debug points in test ? Because, when I tried to reproduce, I was able to get the BlockStates, COMMITTED and COMPLETED for penultimate and last blocks respectively, But not infinite loop of checkLeases() After client shutdown, when the checkLeases() triggers, it will start block recovery for the last block. This last block recovery will indeed Succeed, but fails in {{commitBlockSynchronization()}} while closing the file due to illegal state of penultimate block. {noformat}java.lang.IllegalStateException: Failed to finalize INodeFile hardLeaseRecoveryFuzzy since blocks[0] is non-complete, where blocks=[blk_1073741825_1003{UCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-099f5e9f-cc0b-4c63-a823-7640471d08e2:NORMAL:127.0.0.1:57247|RBW]]}, blk_1073741828_1007]. at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4650) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4865) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4829) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:739){noformat} But, since {{commitBlockSynchronization()}} removes the lease before trying to close the file in {{finalizeINodeFileUnderConstruction()}}, {{checkLeases()}} will not go into infinite loop. May be it will loop till {{commitBlockSynchronization()}} call comes after recovery of last block. {code} private void finalizeINodeFileUnderConstruction(String src, INodeFile pendingFile, int latestSnapshot) throws IOException, UnresolvedLinkException { assert hasWriteLock(); FileUnderConstructionFeature uc = pendingFile.getFileUnderConstructionFeature(); Preconditions.checkArgument(uc != null); leaseManager.removeLease(uc.getClientName(), src);{code} Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File =
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220630#comment-14220630 ] Yongjun Zhang commented on HDFS-4882: - HI [~vinayrpet], Thanks a lot for looking into. Ravi wrote a testcase in HDFS-7342 that reproduce the infinite loop. I'm seeing the same kind of infinite loop in a case here too. For your case, did you check whether you run exercised the code {code} switch(lastBlockState) { case COMPLETE: assert false : Already checked that the last block is incomplete; break; {code} in {{ FSNamesystem#internalReleaseLease()}}? If LeaseManager found that the hard limit for the file lease expired, and when the above code is exercised (that means penultimateBlock is COMMITTED and lastBlock is COMPLETE), it causes an infinite loop in {{LeaseManager#checkLeases}}. Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, HDFS-4882.6.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218090#comment-14218090 ] Yongjun Zhang commented on HDFS-4882: - For other folks' info, since we are dedicating this jira to address the infinite loop issue, Ravi and I are continuing the discussion about the solution for the lease recovery issue in HDFS-7342 now. Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219030#comment-14219030 ] Colin Patrick McCabe commented on HDFS-4882: bq. For other folks' info, since we are dedicating this jira to address the infinite loop issue, Ravi and I are continuing the discussion about the solution for the lease recovery issue in HDFS-7342 now. Thanks. That makes sense to me. {code} 131 } 132 LOG.info(Number of blocks under construction: + numUCBlocks); 133 return numUCBlocks; {code} The indentation is off here. Can we somehow issue a warning if these leases are lingering? The current patch makes the {{LeaseManager}} silently accept the extra leases, which I don't think is quite what we want. Perhaps right before {{return needSync}} we could insert a check that the lease we were last considering was the first lease by sort order, and {{LOG.warn}} a message if it wasn't. This way we would know that something was messed up when leases lingered forever. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216573#comment-14216573 ] Ravi Prakash commented on HDFS-4882: Hi Yongjun! Thanks for your detailed analysis. I am trying to create a unit test where the penultimate block would be COMMITTED and the last block COMPLETE. I intend to upload that patch to HDFS-7342. I think we should eliminate the possibility of the infinite loop in LeaseManager regardless. Which JIRAs we use is now irrelevant since this will likely not make 2.6.0 Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216738#comment-14216738 ] Yongjun Zhang commented on HDFS-4882: - Hi Ravi, I think it's ok to separate into two issues to solve here. The latest patch for HDFS-4882 does resolve the infinite loop one (except the lease may still be hanging there). When you work on HDFS-7342, if you can create a test case for penultimate block would be COMMITTED and the last block COMPLETE, I would encourage you to also create some other cases for different state combination listed in the able in my last comment. Appreciate it very much! Hi [~cmccabe], thanks for reviewing Ravi's patch here, I think it's in good shape to solve the infinite loop. Would you please help looking at it and and committing it? Other folks welcome to comment. Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216811#comment-14216811 ] Ravi Prakash commented on HDFS-4882: Thanks Yongjun! That's a great suggestion. I'll do that. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217161#comment-14217161 ] Ravi Prakash commented on HDFS-4882: I have uploaded a unit test on HDFS-7342. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217198#comment-14217198 ] Yongjun Zhang commented on HDFS-4882: - Many thanks Ravi! I will try it out. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217289#comment-14217289 ] Yongjun Zhang commented on HDFS-4882: - HI Ravi, Great work, I can see your test reproduced the issue! While I tried to see if my proposed fix at https://issues.apache.org/jira/browse/HDFS-4882?focusedCommentId=14215703page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14215703 solve the problem, I found that {code} finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId()); {code} does a precondition check that all blocks should be complete. I wonder whether this branch of code has ever been exercised. Will poke a bit more. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214398#comment-14214398 ] Colin Patrick McCabe commented on HDFS-4882: [~yzhangal]: backing up a little bit, the overall problem here seems to be that we are unwilling to recover the lease for an overdue lease when the replication is too low. Right? And then we get in this infinite loop, because {{checkLeases}} assumes that all expired leases will be recovered rather than lingering around. Is there any reason we can't simply recover the lease anyway, even though the minimal replication has not been met? There are a lot of cases where we just can't get to the minimum replication (i.e. 1-node cluster, etc.). I don't see a lot of value in letting these leases linger forever. Our lease expiry period is REALLY long, so if we can't replicate in that period, maybe it's time to throw in the towel. Am I missing something here? What do you guys think? Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214879#comment-14214879 ] Yongjun Zhang commented on HDFS-4882: - Hi [~cmccabe], What I speculated was, if the last block does not have minimal replication (usually is 1), and if the lease is recovered, and then someone else try to append to the same file, where to append the data to? As I commented earlier, I agree if we keep the lease and the minimal replication never got reached, then this is still going to be infinite loop, that is, checkLeases is run every 2 seconds, in each run, the same lease can not be recovered because the minimal replication is not reached. But this infinite loop is different than the infinite loop we are dealing with in this jira, which is a real infinite loop within one checkLeases call, which would take the FSNamesystem.writeLock() infinitely, thus causing much bigger trouble (the former one is not a real infinite loop in this sense). For case that minimal replication can never be reached when minimal replication is 1, it means data loss. Is this common? Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214952#comment-14214952 ] Yongjun Zhang commented on HDFS-4882: - Hi Ravi, Just saw you comment: {quote} In scenario #2 (and in fact in every scenario I traced), shouldn't there be the warning message logged? I did NOT see this message {quote} I described scenario#2 in my earlier comment for completeness. In the infinite loop case reported here and the one I looked at, the root cause is described in my last comment. Hope that makes sense to you. Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215071#comment-14215071 ] Ravi Prakash commented on HDFS-4882: Aah! I see now. Because assertions are not enabled, I don't see the AssertionException. That makes sense. The log excerpt are here : https://issues.apache.org/jira/browse/HDFS-7342?focusedCommentId=14209022page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14209022 Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215089#comment-14215089 ] Jing Zhao commented on HDFS-4882: - bq. Is there any reason we can't simply recover the lease anyway, even though the minimal replication has not been met? There are a lot of cases where we just can't get to the minimum replication (i.e. 1-node cluster, etc.). I don't see a lot of value in letting these leases linger forever. I agree with [~cmccabe] here. I do not think we should block anything just because the penultimate block cannot reach the minimum replication. To me this is similar with the scenario that one block of the file is missing/corrupted. For append case since we do not append data directly to the penultimate block thus I think should be ok. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215345#comment-14215345 ] Yongjun Zhang commented on HDFS-4882: - HI Jing, Thanks for your comments. There are multiple issues here: 1. The bug in the loop code of the checkLeases method that it doesn't move on to next entry 2. The case when penultimate block is COMMITTED and last block is COMPLETE, 3. The case when penultimate block is COMPLETE and the last block is COMMITTED, Observations about these issues: Issue 1 is addressed by the uploaded patch so far. Issue 3 is handled by existing code by throwing an exception (senario#2 in my earlier comments), and then the leaseManager goes ahead recover the lease right away, even if the last block may not have minimal replica. I raised a concern about this earlier: what happens if the minimal replica is 1, which means no replica for the last block, and another client acquires the lease and tries to APPEND to the file. I mean, where (datanode, replica position) the client should write to? Issue 2, we can move on and let the penultimate block to finish it self, if it finishes meeting the minimal replication requirement, that's nice, if not, it's considered a corrupted block. As Colin/Jing suggested. So it boils down to issue 3, did I misunderstand it? is this a real issue? Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215445#comment-14215445 ] Yongjun Zhang commented on HDFS-4882: - HI Ravi, I looked at the log you posted at https://issues.apache.org/jira/browse/HDFS-7342?focusedCommentId=14209022page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14209022, they appear to very much the same as I observed in the case I was looking at: {code} 2014-10-28 02:28:17,568 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: DFSClient_attempt_X_Y_r_T_U_V_W, pendingcreates: 1] has expired hard limit 2014-10-28 02:28:17,569 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_attempt_X_Y_r_T_U_V_W, pendingcreates: 1], src=FILENAME 2014-10-28 02:28:17,569 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: DFSClient_attempt_X_Y_r_T_U_V_W, pendingcreates: 1] has expired hard limit 2014-10-28 02:28:17,569 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_attempt_X_Y_r_T_U_V_W, pendingcreates: 1], src=FILENAME .. {code} HI [~jingzhao], {quote} I do not think we should block anything just because the penultimate block cannot reach the minimum replication. {quote} For the quoted statement from you above, I agree with you on that for penultimate block because no one is going to write to the penultimate block. However, when I studied the implementation, I had the question about issue 3 in my last comment about the last block. Basically if the last block doesn't have a replica reported from DN yet, and the lease is recovered immediately, how do we handle a consequent APPEND write from another client to this file? It seems to me that we should at least wait for some iterations for more block reports before recovering the lease, instead of just recovering the lease immediately. I'd appreciate if you could comment on that. Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215486#comment-14215486 ] Jing Zhao commented on HDFS-4882: - For the scenario 3 I think we can keep the current behavior when handling append. Currently our default replace-datanode-on-failure policy already considers the data durability while appending. Also the leaseManager in NN waits for the lease to expire before triggering the recovery. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215540#comment-14215540 ] Yongjun Zhang commented on HDFS-4882: - Thanks Jing. Sure, we can keep the old behaviour for scenario#3. However, I am just not sure where the data will be written to when the new client tries to append to the file whose last block doesn't even have one replica. That seems a bug to me. This is just my speculation. We could address this in a different jira though. That said, for the current patch, we still need to fix here: {code} switch(lastBlockState) { case COMPLETE: assert false : Already checked that the last block is incomplete; break; {code} The above code does {{asser false Already checked that the last block is incomplete}}, which is wrong, it should not assert false, because this is a valid case when penultimate block is COMMITTED and last block is COMPLETE. Right? Thanks. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215703#comment-14215703 ] Yongjun Zhang commented on HDFS-4882: - HI Jing, I gave some more thoughts on the block of code I quoted in my last comment, I think some additional change is still needed to ensure the lease is recovered for the case when penultimate block is COMMITTED and last block is COMPLETE (call it caseX below), in addition to the {{assert false}} I talked about in my last comment. Commented below. Hi Ravi, I did a review of your latest patch (v4), and have the following comments. Most important one is, even though the infinite loop is removed by the patch, the lease for caseX is still not released, if the penultimate block stays in COMMITTED state. We should release the lease here. * The change in LeaseManager looks good to me. There is an option not to make this change it if we do below. * For caseX, to make sure the lease is released, we need to do something like below {code} boolean penultimateBlockMinReplication = penultimateBlock == null ? true : blockManager.checkMinReplication(penultimateBlock); BlockUCState penultimateBlockState = penultimateBlock == null ? BlockUCState.COMPLETE: penultimateBlock.getBlockUCState(); String blockStateStr; .. case COMPLETE: case COMMITTED: blockStateStr = + (penultimateBlockState= + penultimateBlockState + lastBlockState=+lastBlockState + ); // Close file if committed blocks are minimally replicated if(penultimateBlockMinReplication blockManager.checkMinReplication(lastBlock)) { finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId()); NameNode.stateChangeLog.warn(BLOCK* + internalReleaseLease: Committed blocks are minimally replicated, + lease removed, file closed + blockStateStr); return true; // closed! } // Cannot close file right now, since some blocks // are not yet minimally replicated. // This may potentially cause infinite loop in lease recovery // if there are no valid replicas on data-nodes. String message = DIR* NameSystem.internalReleaseLease: + Failed to release lease for file + src + . Committed blocks are waiting to be minimally replicated + blockStateStr + , Try again later.; NameNode.stateChangeLog.warn(message); throw new AlreadyBeingCreatedException(message); {code} Basically I suggest to let the two cases to share the same code, and included both block's state in the message to distinguish, this handles the different scenarios like below. || || BlockState || BlockState || BlockState ||BlockState || BlockState || BlockState || |penultimateBlock | COMPLETE |COMMITTED|COMMITTED|COMPLETE |COMMITTED|COMMITTED| |lastBlock|COMMITTED|COMMITTED | COMPLETE|COMMITTED|COMMITTED | COMPLETE| |minReplicaSatisfied|Yes|Yes|Yes|No|No|No| | Solution|CloseFile+ReleaseLease|CloseFile+ReleaseLease|CloseFile+ReleaseLease|ReleaseLease|ReleaseLease|ReleaseLease| Do you and other folks think my proposal makes sense? Thanks a lot. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213992#comment-14213992 ] Yongjun Zhang commented on HDFS-4882: - HI, Thanks Zesheng for reporting the issue, Ravi for working on the solution and other folks for reviewing. I was looking into an infinite loop case when doing checkLeases myself, and figured out that the logic in FSNamesystem#internalReleaseLease {code} switch(lastBlockState) { case COMPLETE: assert false : Already checked that the last block is incomplete; break; {code} doesn't take care of the case that penultimate block is COMMITTED and final block is COMPLETE, thus caused the infinite loop. Looking at the history of this jira, I found [~jingzhao] suggested the same at https://issues.apache.org/jira/browse/HDFS-4882?focusedCommentId=14207202page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14207202 I did some analysis to share here (sorry for a long post). When the final block is COMMITTED, the current implementation does the following: {code} case COMMITTED: // Close file if committed blocks are minimally replicated === senario#1 if(penultimateBlockMinReplication blockManager.checkMinReplication(lastBlock)) { finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId()); NameNode.stateChangeLog.warn(BLOCK* + internalReleaseLease: Committed blocks are minimally replicated, + lease removed, file closed.); return true; // closed! } // Cannot close file right now, since some blocks scenario#2 // are not yet minimally replicated. // This may potentially cause infinite loop in lease recovery // if there are no valid replicas on data-nodes. String message = DIR* NameSystem.internalReleaseLease: + Failed to release lease for file + src + . Committed blocks are waiting to be minimally replicated. + Try again later.; NameNode.stateChangeLog.warn(message); throw new AlreadyBeingCreatedException(message); {code} What it does: * For scenario#1, check minReplication for both penultimate and last block, if satisifed, finalize the block (recover lease, close file) * For scenario#2, throw AlreadyBeingCreatedException derived from IOException (the name of this exception appears to be a misnomer, maybe we should fix later). To solve the case that penultimate block is COMMITTED and final block is COMPLETE, I'd suggest to make some changes on top of the submitted patch (for further discussion): For scenario#1, we can do the same as when the last block is COMMITTED, as described above. For scenario#2, I think we have two options: # option A, drop the code in the existing code that handles scenario#2 (not to throw the exception), let checkLeases check back again (2 second is current internal), waiting for block report to finish to change the minimal replication situation then recover the lease. The infinite loop could still happen if the minimal replication never get satisfied. But this would be rare assuming the minimal replication can be satisfied eventually. # option B, do the similar logic as in the existing code (throwing AlreadyBeingCreatedException). There is an issue for this option too that I can see, and described below. With option B, look at the caller side (LeaseManager#checkLeases, whenever an IOException is caught, it just go ahead removing the lease. So the possible infinite loop described in the scenario#2 comment will not happen because of the lease removal (lease recovered). But the problem with option B is, after the lease removal, the file may still have blocks not satisfying minimal replication (scenario#2), which would be a potential issue. This situation exists in current implementation when handling the case that the last block is COMMITTED. I think we should we wait for minimal replication to be satisfied before recovering the lease. So looks like option A is more preferable. But the original code tries to recover the lease immediately, I'm not sure whether there is any catch here. Comments, thoughts? Thanks again. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213999#comment-14213999 ] Yongjun Zhang commented on HDFS-4882: - BTW, the scenario reported in the jira description is one special one. The infinite loop could happen whenever penultimate block is COMMITTED and final block is COMPLETE for other scenarios. That is, the LeaseManager#checkLeases method is holding FSNamesystem.writeLock() all the time because of the infinite loop, all the threads in NN that process block report will be blocked waiting for FSNamesystem.writeLock(). So even if a block report is ready to satisfy the minimal replication, it won't be processed. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213109#comment-14213109 ] Ravi Prakash commented on HDFS-4882: These test errors are valid. They are happening because pollFirst() retrieved *and removes* the first element. Sorry for the oversight. Will upload a new patch soon Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213324#comment-14213324 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681664/HDFS-4882.4.patch against trunk revision 4fb96db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211550#comment-14211550 ] Colin Patrick McCabe commented on HDFS-4882: The patch looks good. bq. Here's a patch which uses SortedSet.tailSet. However I still like the earlier patch more (because its a genuine case of two threads accessing the same data-structure). With tailSet we are just trying to build our own synchronization mechanism (which is likely more inefficient than the ConcurrentinternalReleaseLease) . I have to admit that I find the locking to be confusing in {{LeaseManager}}. It seems like some methods which access the sets are synchronized, and some are not. And we are exposing the sets to outside code, which accesses them without synchronization. It's very inconsistent and we should file a follow-up JIRA to look into this. I don't think there is necessarily a bug there, though... I think it's possible that the access is actually gated on the FSN lock. If that's true, we should document it and remove the synchronized blocks in {{LeaseManager}} (since in that case they would not be doing anything useful). Concurrent sets are not a magic bullet since we still may have to deal with issues like Lease objects being deactivated while another caller holds on to them, etc. If we dig too far into locking, this patch will get a lot bigger and a lot messier. I think that we should keep this patch small and focused on the problem. In your patch is that you are requiring the FSN lock to be held in {{LeaseManager#getNumUnderConstructionBlocks}}, but you don't document that anywhere. Can you please add that to a JavaDoc or other comment for this method? {code} +Lease leaseToCheck = null; +try { + leaseToCheck = sortedLeases.first(); +} catch(NoSuchElementException e) {} {code} You can replace this with {{leaseToCheck = sortedLeases.pollFirst();}} {code} + try { +leaseToCheck = sortedLeases.tailSet(leaseToCheck, false).first(); + } catch(NoSuchElementException e) { +leaseToCheck = null; } {code} You don't need this. Just use {{leaseToCheck = sortedLeases.higher(leaseToCheck)}}. bq. I'd also request for this to make it into 2.6.0 because of this issue's severity. I'm fine with the second version going into 2.6. [~acmurthy], others, what do you think? Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1,
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211932#comment-14211932 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681481/HDFS-4882.3.patch against trunk revision d005404. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestCheckpoint org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA org.apache.hadoop.hdfs.TestFileCreation {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8735//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8735//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8735//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208478#comment-14208478 ] Ravi Prakash commented on HDFS-4882: s/ConcurrentinternalReleaseLease/ConcurrentSkipList/ Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208622#comment-14208622 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681116/HDFS-4882.2.patch against trunk revision be7bf95. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerDynamicBehavior org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8723//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8723//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208657#comment-14208657 ] Ravi Prakash commented on HDFS-4882: These unit tests failures are spurious and unrelated to the code changes in the patch. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.2.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206930#comment-14206930 ] Ravi Prakash commented on HDFS-4882: I'll try to refactor the code to use SortedSet anyway. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207202#comment-14207202 ] Jing Zhao commented on HDFS-4882: - bq. In this case, for a very strange reason which I have as yet not been able to uncover, the FSNamesystem wasn't able to recover the lease. According to [~wuzesheng]'s earlier analysis, this may happen when the penultimate block is in COMMITTED state while the last block is in COMPLETE state. The following code in FSNamesystem#internalReleaseLease may have some issue: {code} switch(lastBlockState) { case COMPLETE: assert false : Already checked that the last block is incomplete; break; {code} Since it is possible that the penultimate block is in COMMITTED state while the last block is in COMPLETE state, the logic of the above code may be wrong. Instead, I think we can combine the COMPLETE and COMMITTED case in the switch expression, so that the current replication factor of both the last and the penultimate blocks are checked. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207393#comment-14207393 ] Ravi Prakash commented on HDFS-4882: Thanks Jing! I was trying to create a unit test to replicate this failure. Please feel free to upload a patch on HDFS-7342 if you can. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205138#comment-14205138 ] Colin Patrick McCabe commented on HDFS-4882: If I understand this correctly, the issue here is that there is an expired lease which is not getting removed from {{sortedLeases}}, causing {{LeaseManager#checkLeases}} to loop forever. Maybe this is a dumb question, but shouldn't we fix the code so that this expired lease does get removed? Is this patch really fixing the root issue? It seems like if we let expired leases stick around forever we may have other problems. Also, this patch seems to replace a {{SortedSet}} with a {{ConcurrentSkipListSet}}. We don't need the overhead of a concurrent set here... the set is only modified while holding the lock. If you want to modify while iterating, you can simply use an {{Iterator}} for this purpose. Or, since the set is sorted, you can use {{SortedSet#tailSet}} to find the element after the previous element you were looking at. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205669#comment-14205669 ] Ravi Prakash commented on HDFS-4882: Thanks for your review Colin! Your understanding is correct. In this case, for a very strange reason which I have as yet not been able to uncover, the FSNamesystem wasn't able to recover the lease. I am investigating this root issue in HDFS-7342. In the meantime, however I'd argue that the Namenode should never enter an infinite loop for whatever reason, and instead of assuming that we have fixed all possible reasons why a lease couldn't be recovered, we should relinquish the lock regularly. We should display on the webUI how many files are open for writing and allow ops to forcibly close open files (HDFS-7307) . The way in which this error happens (NN suddenly stops working) is egregious. sortedLeases is being used externally in FSNamesystem.getCompleteBlocksTotal() as well . We were also actively modifying it in checkLeases. I'm sure we can move things around to keep using SortedSets, but I don't know if this Collection will ever really become too big for the performance difference to matter. What do you think? Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195227#comment-14195227 ] Hadoop QA commented on HDFS-4882: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12678992/HDFS-4882.1.patch against trunk revision 67f13b5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8628//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8628//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195675#comment-14195675 ] Ravi Prakash commented on HDFS-4882: Can someone please review and commit this JIRA? Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195678#comment-14195678 ] Ravi Prakash commented on HDFS-4882: I've filed HDFS-7342 to investigate why the lease wasn't recovered. [~wuzesheng] and [~umamaheswararao] were you ever able to consistently produce the situation in which leases weren't recovered? Perhaps as a unit test? Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Priority: Critical Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191183#comment-14191183 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12678329/HDFS-4882.patch against trunk revision 5e3f428. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestEditLog The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDFSClientRetries org.apache.hadoop.hdfs.TestLeaseRecovery2 {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8605//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8605//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha, 2.5.1 Reporter: Zesheng Wu Assignee: Ravi Prakash Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189396#comment-14189396 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12586700/4882.patch against trunk revision d33e07d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestDeleteRace {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8589//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8589//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha Reporter: Zesheng Wu Attachments: 4882.1.patch, 4882.patch, 4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719847#comment-13719847 ] Shrijeet Paliwal commented on HDFS-4882: Just wanted to report that we have been hit by this (or an issue that produces exactly same symptoms) twice in 3 days. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha Reporter: Zesheng Wu Attachments: 4882.1.patch, 4882.patch, 4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697743#comment-13697743 ] Uma Maheswara Rao G commented on HDFS-4882: --- {quote} When there's DN in the pipeline down and the pipeline stage is PIPELINE_CLOSE, the client triggers the data replication, do not wait the NN to do this(NN needs the file be finalized to do the replication, but finalized need all the blocks have at least dfs.namenode.replication.min(=2) replicas, these two conditions are contradicting). {quote} What you mean by 'the client triggers the data replication'? File will not be finalized until it reached at least min replication blocks. But block replication can be started once it is committed to DN.(here each DN will finalize its block and report NN). On reaching min replication NN will complete that block. If all Blocks in NN are in complete state then file can be closed normally. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha Reporter: Zesheng Wu Attachments: 4882.1.patch, 4882.patch, 4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697784#comment-13697784 ] Zesheng Wu commented on HDFS-4882: -- 1. {quote}What you mean by 'the client triggers the data replication'?{quote} I mean let the client goto the following replica transfer process: {code} //transfer replica final DatanodeInfo src = d == 0? nodes[1]: nodes[d - 1]; final DatanodeInfo[] targets = {nodes[d]}; transfer(src, targets, lb.getBlockToken()); {code} Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha Reporter: Zesheng Wu Attachments: 4882.1.patch, 4882.patch, 4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697787#comment-13697787 ] Zesheng Wu commented on HDFS-4882: -- {quote} File will not be finalized until it reached at least min replication blocks. But block replication can be started once it is committed to DN.(here each DN will finalize its block and report NN). On reaching min replication NN will complete that block. If all Blocks in NN are in complete state then file can be closed normally. {quote} Yes, I think you are right. But the block replication is not committed to DN, so is not started. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha Reporter: Zesheng Wu Attachments: 4882.1.patch, 4882.patch, 4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678330#comment-13678330 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12586700/4882.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4497//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4497//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha Reporter: Zesheng Wu Attachments: 4882.1.patch, 4882.patch, 4882.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676941#comment-13676941 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12586483/4882.1.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4485//console This message is automatically generated. Namenode LeaseManager checkLeases() runs into infinite loop --- Key: HDFS-4882 URL: https://issues.apache.org/jira/browse/HDFS-4882 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Affects Versions: 2.0.0-alpha Reporter: Zesheng Wu Attachments: 4882.1.patch Scenario: 1. cluster with 4 DNs 2. the size of the file to be written is a little more than one block 3. write the first block to 3 DNs, DN1-DN2-DN3 4. all the data packets of first block is successfully acked and the client sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out 5. DN2 and DN3 are down 6. client recovers the pipeline, but no new DN is added to the pipeline because of the current pipeline stage is PIPELINE_CLOSE 7. client continuously writes the last block, and try to close the file after written all the data 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2), and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes the last block's state to COMPLETE 9. shutdown the client 10. the file's lease exceeds hard limit 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease() 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite loop and prints massive logs like this: {noformat} 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard limit 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= /user/h_wuzesheng/test.dat 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597, lastBLockState=COMPLETE 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM APREDUCE_-1252656407_1, pendingcreates: 1] {noformat} (the 3rd line log is a debug log added by us) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira