subject:"\[jira\] \[Commented\] \(HDFS\-4882\) Namenode LeaseManager checkLeases\(\) runs into infinite loop"

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-23 Thread Colin Patrick McCabe (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222629#comment-14222629
]

Colin Patrick McCabe commented on HDFS-4882:

Hi [~vinayrpet],

This change makes the code more robust because it avoids going into an infinite
loop in the case where the lease is not removed from {{LeaseManager#leases}}
during the loop body. The change doesn't harm anything... things are just as
efficient as before, and in the unlikely case that we can't remove the lease,
we log a warning message so we are aware of the problem.

Why don't we continue the discussion about the sequence of operations that
could trigger this over on HDFS-7342? And commit this in the meantime to fix
the immediate problem for [~raviprak]. I am +1, any objections to committing
this tomorrow?

Also, [~yzhangal], can you comment on whether you have also observed this bug?
Vinayakumar seems to be questioning whether this loop can occur, but I thought
you had seen the LeaseManager thread loop in the field... I apologize if I'm
putting words in your mouth, though.

Namenode LeaseManager checkLeases() runs into infinite loop
---

Key: HDFS-4882
URL: https://issues.apache.org/jira/browse/HDFS-4882
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch,
HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch,
HDFS-4882.6.patch, HDFS-4882.7.patch, HDFS-4882.patch

Scenario:
1. cluster with 4 DNs
2. the size of the file to be written is a little more than one block
3. write the first block to 3 DNs, DN1-DN2-DN3
4. all the data packets of first block is successfully acked and the client
sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
5. DN2 and DN3 are down
6. client recovers the pipeline, but no new DN is added to the pipeline
because of the current pipeline stage is PIPELINE_CLOSE
7. client continuously writes the last block, and try to close the file after
written all the data
8. NN finds that the penultimate block doesn't has enough replica(our
dfs.namenode.replication.min=2), and the client's close runs into indefinite
loop(HDFS-2936), and at the same time, NN makes the last block's state to
COMPLETE
9. shutdown the client
10. the file's lease exceeds hard limit
11. LeaseManager realizes that and begin to do lease recovery by call
fsnamesystem.internalReleaseLease()
12. but the last block's state is COMPLETE, and this triggers lease manager's
infinite loop and prints massive logs like this:
{noformat}
2013-06-05,17:42:25,695 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder:
DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
limit
2013-06-05,17:42:25,695 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease.
Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
/user/h_wuzesheng/test.dat
2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block
blk_-7028017402720175688_1202597,
lastBLockState=COMPLETE
2013-06-05,17:42:25,695 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery
for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM
APREDUCE_-1252656407_1, pendingcreates: 1]
{noformat}
(the 3rd line log is a debug log added by us)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-23 Thread Vinayakumar B (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222680#comment-14222680
]

Vinayakumar B commented on HDFS-4882:
-

bq. This change makes the code more robust because it avoids going into an
infinite loop in the case where the lease is not removed from
LeaseManager#leases during the loop body. The change doesn't harm anything...
things are just as efficient as before, and in the unlikely case that we can't
remove the lease, we log a warning message so we are aware of the problem
Yes, you are right. Even though I don't see the possibility of infinite loop by
considering existing code in trunk, changes made in the patch looks pretty cool.

I am +1 for the change.

{quote}Why don't we continue the discussion about the sequence of operations
that could trigger this over on HDFS-7342? And commit this in the meantime to
fix the immediate problem for Ravi Prakash. I am +1, any objections to
committing this tomorrow?
Also, Yongjun Zhang, can you comment on whether you have also observed this
bug? Vinayakumar seems to be questioning whether this loop can occur, but I
thought you had seen the LeaseManager thread loop in the field... I apologize
if I'm putting words in your mouth, though.{quote}
Yes, lets continue this discussion in HDFS-7342

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-23 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222737#comment-14222737
]

Yongjun Zhang commented on HDFS-4882:
-

Hi [~cmccabe] and [~vinayrpet],

Thanks for your comments.

{quote}
can you comment on whether you have also observed this bug?
{quote}
Yes, I did observe a similar infinite loop, and by studying the code, I
concluded that the case I was looking at has exactly the same root cause as the
one reported here. Please see details described in my earlier comment at
https://issues.apache.org/jira/browse/HDFS-4882?focusedCommentId=14213992page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14213992,
the several comments after that.

In short, when the penultimate block is COMMITTED and the last block is
COMPLETE, the following block of code will be executed
{code}
switch(lastBlockState) {
case COMPLETE:
assert false : Already checked that the last block is incomplete;
break;
{code}
and return back to LeaseManager without releasing the corresponding lease,
which stays as the first element in {{sortedLeases}}. The leaseManager keeps
examining the first entry in sortedLease again and again, while holding the
FSNamesystem#writeLock, thus causing the infinite loop.

{quote}
Yes, you are right. Even though I don't see the possibility of infinite loop by
considering existing code in trunk, changes made in the patch looks pretty cool.
{quote}
See above for the explanation about infinite loop in the existing code.

{quote}
Yes, lets continue this discussion in HDFS-7342
{quote}
In HDFS-7342, Ravi worked out a testcase to demonstrate the problem and I
suggested a solution. Thanks in advance for your review and comments there.
Hope we can get to a converged solution soon. Since avoiding infinite loop is
just part of the complete solution, and the other part is to get the lease
released, which is what HDFS-7342 tries to address.

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-21 Thread Colin Patrick McCabe (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220680#comment-14220680
]

Colin Patrick McCabe commented on HDFS-4882:

[~yzhangal]: I agree, we should fix the {{LOG.warn(Unable to release
hard-limit expired lease ...}} code a bit. I agree with the idea of replacing
the {[return}} in the loop body with a {{break}}, and putting the LOG at the
end. One way of looking at this is that we're trying to enforce the invariant
that if there are any leases left at the end of the function, they should be
unexpired leases.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-21 Thread Vinayakumar B (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220705#comment-14220705
]

Vinayakumar B commented on HDFS-4882:
-

Test added in HDFS-7342 is not the real test. blocks states are set from the
testcode itself.
Not occured on operation. {{lm.setLeasePeriod(0, 0);}} is the one which will
lead to infinite loop in that testcode.

Here are the following sequence of operations happens on lease recovery on hard
limit expiry.
1. Last block will be in *UNDER_CONSTRUCTION or UNDER_RECOVERY* state
initially, since the block is not full and also file is not closed yet.
2. Check the block states, if last block is in
UNDER_CONSTRUCTION/UNDER_RECOVERY state, then *re-assign* the lease with
HDFS-Namenode. Here original expired lease will be removed and Fresh
HDFS-Namenode will be added. Note that *This lease is not expired*.
3. Queues the block to be recovered in primary datanode.
4. In LeaseManager, {{sortedLeases}} will have HDFS-Namenode lease will be
present, instead of the original expired lease. *So checkLeases() will not
enter infinite loop*
5. Now {{commitBlockSynchronization()}} after recovery of the last block, *will
make the last block COMPLETED, removes the lease*, but fails to close the file
because of not meeting min replication for penultimate block.
Since the lease is removed already, checkLeases() will not enter infinite loop.

Interestingly, I tried restart of Namenode after above steps, then while
loading edits, all blocks except last block will be treated as COMPLETE.
So penultimate block was in COMPLETE state.
Hence Leaserecovery was successfull for the same file.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-21 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221046#comment-14221046
]

Yongjun Zhang commented on HDFS-4882:
-

Thanks [~cmccabe] and [~vinayrpet] for the review and further look.

HI Vinayakumar,

Yes, the testcase Ravi created in HDFS-7342 is a bit artificial, but it helps
to reproduce the infinite loop case we are dealing with here. Setting the
leasePeriod to 0 is just to trigger the lease recovery thus the infinite loop
in this test.

The steps you described appear to be a different scenario than what we are
addressing in this jira. The case here is that the lastBlock is already in
COMPELTE state when the hard lease limit expires. Your case is that the last
block is not yet in COMPLETE state. When the last block is in
*UNDER_CONSTRUCTION or UNDER_RECOVERY* state, it's handled differently in
{{FSNamesystem#internalReleaseLease}}, as you described, I believe that's why
you don't see the infinitely loop.

What you described is an interesting case too though. for the failure in step 5
you described, would you please create a different jira for your case? Thanks.

HI [~raviprak], would you please help addressing my earlier comments that Colin
agreed upon? thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-21 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221393#comment-14221393
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12682913/HDFS-4882.7.patch
  against trunk revision c298a9a.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8801//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8801//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, 
 HDFS-4882.6.patch, HDFS-4882.7.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-21 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221458#comment-14221458
]

Yongjun Zhang commented on HDFS-4882:
-

Hi Ravi, many thanks for the new rev! It looks good to me.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219599#comment-14219599
]

Yongjun Zhang commented on HDFS-4882:
-

Thanks a lot [~cmccabe], good comments!

Hi [~raviprak], many thanks for your earlier work.

Without adding the warn message Colin suggested, there will still be a message
reported every 2 seconds like Recovering lease... when the lease can't be
released, but adding the warn message would help us identify a similar issue
easier.

This is a quite urgent issue for the case I was looking at, I'd really
appreciate that you could address Colin's comments at your earliest
convenience. Thanks!

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219746#comment-14219746
]

Yongjun Zhang commented on HDFS-4882:
-

Hi Ravi,

Thanks for responding quickly!

I looked at the change, a minor suggestion:
{code}
470 if(leaseToCheck != sortedLeases.first()) {
471 LOG.warn(Unable to release the oldest lease: +
sortedLeases.first());
472 }
{code}
the lease that can not be released may not be the oldest in the loop. Maybe we
can change to Unable to release hard-limit expired lease: to be more
accurate. On the other hand, all older leases are released already in this
checkReleases() run, this un-released lease will be the oldest lease 2 seconds
later:-)

Another thought is, in each checkLeases run, there might be multiple leases
that can not be released, we only WARN one here with your latest change. If we
replace the above code with a loop to check not only the first, but also the
second, the third, until it hits leaseToCheck, then we are able to report all
leases that are not released. Just a thought, not necessarily we have to do it
in this jira, because having one is already a good indicator, and we do have
the report every 2 second for all leases we attempt to release.

Hi Colin: further comments?

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219919#comment-14219919
]

Ravi Prakash commented on HDFS-4882:

Hi Yongjun!
sortedLeases is modified while iterating through the loop. So the
sortedLease.first() is necessarily the oldest lease and if leaseToCheck !=
sortedLeases.first(), then it means we are looking to release a lease younger
than the oldest.
I thought about logging all the leases which couldn't be released, but
considering that we expect this to be a rare occurrence, I didn't see the
cost-benefit in that extra code which will probably never run

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219936#comment-14219936
]

Yongjun Zhang commented on HDFS-4882:
-

HI Ravi,

I agree that we don't really need to WARN all the leases.

What I meant by it may not be the oldest is, in each run of checkReleases
(which happens every 2 seconds), the first (and the real oldest in this loop)
lease examined in the loop can be released and removed from sortedLease, then
we move on to next item in the sortedLease. In that sense, a un-released lease
may not be the oldest among all the leases examined in this loop.

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220013#comment-14220013
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12682699/HDFS-4882.5.patch
  against trunk revision a9a0cc3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8789//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8789//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, 
 HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220198#comment-14220198
]

Yongjun Zhang commented on HDFS-4882:
-

Hi Ravi, many thanks for addressing Colin's and my comments quickly!

Hi [~cmccabe], the latest patch looks good to me. Would you please help taking
a further look? thanks a lot.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220255#comment-14220255
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12682760/HDFS-4882.6.patch
  against trunk revision eb4045e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-hdfs-project/hadoop-hdfs 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8796//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8796//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, 
 HDFS-4882.6.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220438#comment-14220438
 ] 

Yongjun Zhang commented on HDFS-4882:
-

Sorry Ravi, after thinking a bit more, I think there is still a hole here,

If the sortedLease has a single one, and it's not able to releaes, the the code 
below won't reach, and the WARN won't be printed out.
{code}
470 if(leaseToCheck != sortedLeases.first()) {
471   LOG.warn(Unable to release hard-limit expired lease: 
472 + sortedLeases.first());
473 }
{code}
I think we should move this code to the very bottom of the method, to before it 
returns:
{code}
if (leaseToCheck != sortedLeases.first()) {
  LOG.warn(Unable to release hard-limit expired lease: 
+ sortedLeases.first());
}
return needSync;
{code}
If sortedLease is empty, leaseToCheck would be null, so we are good;
if sortedLeaes has only one item which can't be released, leaseCheck should be 
assigned to null at the end of the loop, we are good too.
Othercases should be covered too.

Right?

Thanks.




{code}

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, 
 HDFS-4882.6.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220473#comment-14220473
]

Yongjun Zhang commented on HDFS-4882:
-

And replace the {{return needSync;}} inside the loop with {{break;}}.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Vinayakumar B (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220617#comment-14220617
 ] 

Vinayakumar B commented on HDFS-4882:
-

Hi [~wuzesheng], [~raviprak] and [~yzhangal],

According to description, Have you able to reproduce the infinite loop in 
checkLeases() ? either in real cluster or using debug points in test ?


Because, when I tried to reproduce, I was able to get the BlockStates, 
COMMITTED and COMPLETED for penultimate and last blocks respectively, But not 
infinite loop of checkLeases()

After client shutdown, when the checkLeases() triggers, it will start block 
recovery for the last block. This last block recovery will indeed Succeed, but 
fails in {{commitBlockSynchronization()}} while closing the file due to illegal 
state of penultimate block. 
{noformat}java.lang.IllegalStateException: Failed to finalize INodeFile 
hardLeaseRecoveryFuzzy since blocks[0] is non-complete, where 
blocks=[blk_1073741825_1003{UCState=COMMITTED, primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-099f5e9f-cc0b-4c63-a823-7640471d08e2:NORMAL:127.0.0.1:57247|RBW]]},
 blk_1073741828_1007].
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:172)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4650)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4865)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4829)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:739){noformat}

But, since {{commitBlockSynchronization()}} removes the lease before trying to 
close the file in {{finalizeINodeFileUnderConstruction()}}, {{checkLeases()}} 
will not go into infinite loop. May be it will loop till  
{{commitBlockSynchronization()}} call comes after recovery of last block.
{code}  private void finalizeINodeFileUnderConstruction(String src,
  INodeFile pendingFile, int latestSnapshot) throws IOException,
  UnresolvedLinkException {
assert hasWriteLock();

FileUnderConstructionFeature uc = 
pendingFile.getFileUnderConstructionFeature();
Preconditions.checkArgument(uc != null);
leaseManager.removeLease(uc.getClientName(), src);{code}



 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, 
 HDFS-4882.6.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File =

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-20 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220630#comment-14220630
 ] 

Yongjun Zhang commented on HDFS-4882:
-

HI [~vinayrpet],

Thanks a lot for looking into. Ravi wrote a testcase in  HDFS-7342 that 
reproduce the infinite loop. I'm seeing the same kind of infinite loop in a 
case here too.

For your case, did you check whether you run exercised the code
{code}
   switch(lastBlockState) {
case COMPLETE:
  assert false : Already checked that the last block is incomplete;
  break;
{code}
in  {{ FSNamesystem#internalReleaseLease()}}?

If LeaseManager found that the hard limit for the file lease expired, and when 
the above code is exercised (that means penultimateBlock is COMMITTED and 
lastBlock is COMPLETE), it causes an infinite loop in 
{{LeaseManager#checkLeases}}.

Thanks.



 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, 
 HDFS-4882.6.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-19 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218090#comment-14218090
]

Yongjun Zhang commented on HDFS-4882:
-

For other folks' info, since we are dedicating this jira to address the
infinite loop issue, Ravi and I are continuing the discussion about the
solution for the lease recovery issue in HDFS-7342 now. Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-19 Thread Colin Patrick McCabe (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219030#comment-14219030
]

Colin Patrick McCabe commented on HDFS-4882:

bq. For other folks' info, since we are dedicating this jira to address the
infinite loop issue, Ravi and I are continuing the discussion about the
solution for the lease recovery issue in HDFS-7342 now. Thanks.

That makes sense to me.

{code}
131 }
132 LOG.info(Number of blocks under construction: + numUCBlocks);
133 return numUCBlocks;
{code}
The indentation is off here.

Can we somehow issue a warning if these leases are lingering? The current
patch makes the {{LeaseManager}} silently accept the extra leases, which I
don't think is quite what we want. Perhaps right before {{return needSync}} we
could insert a check that the lease we were last considering was the first
lease by sort order, and {{LOG.warn}} a message if it wasn't. This way we
would know that something was messed up when leases lingered forever.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-18 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216573#comment-14216573
]

Ravi Prakash commented on HDFS-4882:

Hi Yongjun!

Thanks for your detailed analysis. I am trying to create a unit test where the
penultimate block would be COMMITTED and the last block COMPLETE. I intend to
upload that patch to HDFS-7342. I think we should eliminate the possibility of
the infinite loop in LeaseManager regardless. Which JIRAs we use is now
irrelevant since this will likely not make 2.6.0

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-18 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216738#comment-14216738
]

Yongjun Zhang commented on HDFS-4882:
-

Hi Ravi,

I think it's ok to separate into two issues to solve here. The latest patch for
HDFS-4882 does resolve the infinite loop one (except the lease may still be
hanging there).

When you work on HDFS-7342, if you can create a test case for penultimate
block would be COMMITTED and the last block COMPLETE, I would encourage you to
also create some other cases for different state combination listed in the able
in my last comment. Appreciate it very much!

Hi [~cmccabe], thanks for reviewing Ravi's patch here, I think it's in good
shape to solve the infinite loop. Would you please help looking at it and and
committing it? Other folks welcome to comment.

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-18 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216811#comment-14216811
]

Ravi Prakash commented on HDFS-4882:

Thanks Yongjun! That's a great suggestion. I'll do that.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-18 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217161#comment-14217161
]

Ravi Prakash commented on HDFS-4882:

I have uploaded a unit test on HDFS-7342.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-18 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217198#comment-14217198
]

Yongjun Zhang commented on HDFS-4882:
-

Many thanks Ravi! I will try it out.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-18 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217289#comment-14217289
]

Yongjun Zhang commented on HDFS-4882:
-

HI Ravi,

Great work, I can see your test reproduced the issue!

While I tried to see if my proposed fix at
https://issues.apache.org/jira/browse/HDFS-4882?focusedCommentId=14215703page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14215703
solve the problem, I found that
{code}
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId());
{code}
does a precondition check that all blocks should be complete. I wonder whether
this branch of code has ever been exercised.

Will poke a bit more.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Colin Patrick McCabe (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214398#comment-14214398
]

Colin Patrick McCabe commented on HDFS-4882:

[~yzhangal]: backing up a little bit, the overall problem here seems to be that
we are unwilling to recover the lease for an overdue lease when the
replication is too low. Right? And then we get in this infinite loop, because
{{checkLeases}} assumes that all expired leases will be recovered rather than
lingering around.

Is there any reason we can't simply recover the lease anyway, even though the
minimal replication has not been met? There are a lot of cases where we just
can't get to the minimum replication (i.e. 1-node cluster, etc.). I don't see
a lot of value in letting these leases linger forever. Our lease expiry period
is REALLY long, so if we can't replicate in that period, maybe it's time to
throw in the towel. Am I missing something here? What do you guys think?

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214879#comment-14214879
]

Yongjun Zhang commented on HDFS-4882:
-

Hi [~cmccabe],

What I speculated was, if the last block does not have minimal replication
(usually is 1), and if the lease is recovered, and then someone else try to
append to the same file, where to append the data to?

As I commented earlier, I agree if we keep the lease and the minimal
replication never got reached, then this is still going to be infinite loop,
that is, checkLeases is run every 2 seconds, in each run, the same lease can
not be recovered because the minimal replication is not reached. But this
infinite loop is different than the infinite loop we are dealing with in this
jira, which is a real infinite loop within one checkLeases call, which would
take the FSNamesystem.writeLock() infinitely, thus causing much bigger trouble
(the former one is not a real infinite loop in this sense).

For case that minimal replication can never be reached when minimal replication
is 1, it means data loss. Is this common?

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214952#comment-14214952
]

Yongjun Zhang commented on HDFS-4882:
-

Hi Ravi,

Just saw you comment:
{quote}
In scenario #2 (and in fact in every scenario I traced), shouldn't there be
the warning message logged? I did NOT see this message
{quote}
I described scenario#2 in my earlier comment for completeness.

In the infinite loop case reported here and the one I looked at, the root cause
is described in my last comment. Hope that makes sense to you.

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215071#comment-14215071
]

Ravi Prakash commented on HDFS-4882:

Aah! I see now. Because assertions are not enabled, I don't see the
AssertionException. That makes sense. The log excerpt are here :
https://issues.apache.org/jira/browse/HDFS-7342?focusedCommentId=14209022page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14209022

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Jing Zhao (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215089#comment-14215089
]

Jing Zhao commented on HDFS-4882:
-

bq. Is there any reason we can't simply recover the lease anyway, even though
the minimal replication has not been met? There are a lot of cases where we
just can't get to the minimum replication (i.e. 1-node cluster, etc.). I don't
see a lot of value in letting these leases linger forever.

I agree with [~cmccabe] here. I do not think we should block anything just
because the penultimate block cannot reach the minimum replication. To me this
is similar with the scenario that one block of the file is missing/corrupted.
For append case since we do not append data directly to the penultimate block
thus I think should be ok.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215345#comment-14215345
]

Yongjun Zhang commented on HDFS-4882:
-

HI Jing,

Thanks for your comments.

There are multiple issues here:

1. The bug in the loop code of the checkLeases method that it doesn't move on
to next entry
2. The case when penultimate block is COMMITTED and last block is COMPLETE,
3. The case when penultimate block is COMPLETE and the last block is COMMITTED,

Observations about these issues:

Issue 1 is addressed by the uploaded patch so far.

Issue 3 is handled by existing code by throwing an exception (senario#2 in my
earlier comments), and then the leaseManager goes ahead recover the lease right
away, even if the last block may not have minimal replica. I raised a concern
about this earlier: what happens if the minimal replica is 1, which means no
replica for the last block, and another client acquires the lease and tries to
APPEND to the file. I mean, where (datanode, replica position) the client
should write to?

Issue 2, we can move on and let the penultimate block to finish it self, if it
finishes meeting the minimal replication requirement, that's nice, if not, it's
considered a corrupted block. As Colin/Jing suggested.

So it boils down to issue 3, did I misunderstand it? is this a real issue?

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215445#comment-14215445
]

Yongjun Zhang commented on HDFS-4882:
-

HI Ravi,

I looked at the log you posted at
https://issues.apache.org/jira/browse/HDFS-7342?focusedCommentId=14209022page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14209022,
they appear to very much the same as I observed in the case I was looking at:

{code}
2014-10-28 02:28:17,568 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder:
DFSClient_attempt_X_Y_r_T_U_V_W, pendingcreates: 1] has expired hard limit
2014-10-28 02:28:17,569 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
Holder: DFSClient_attempt_X_Y_r_T_U_V_W, pendingcreates: 1], src=FILENAME
2014-10-28 02:28:17,569 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder:
DFSClient_attempt_X_Y_r_T_U_V_W, pendingcreates: 1] has expired hard limit
2014-10-28 02:28:17,569 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
Holder: DFSClient_attempt_X_Y_r_T_U_V_W, pendingcreates: 1], src=FILENAME
..
{code}

HI [~jingzhao],
{quote}
I do not think we should block anything just because the penultimate block
cannot reach the minimum replication.
{quote}
For the quoted statement from you above, I agree with you on that for
penultimate block because no one is going to write to the penultimate block.
However, when I studied the implementation, I had the question about issue 3 in
my last comment about the last block. Basically if the last block doesn't have
a replica reported from DN yet, and the lease is recovered immediately, how do
we handle a consequent APPEND write from another client to this file? It seems
to me that we should at least wait for some iterations for more block reports
before recovering the lease, instead of just recovering the lease immediately.

I'd appreciate if you could comment on that.

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Jing Zhao (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215486#comment-14215486
]

Jing Zhao commented on HDFS-4882:
-

For the scenario 3 I think we can keep the current behavior when handling
append. Currently our default replace-datanode-on-failure policy already
considers the data durability while appending. Also the leaseManager in NN
waits for the lease to expire before triggering the recovery.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215540#comment-14215540
]

Yongjun Zhang commented on HDFS-4882:
-

Thanks Jing.

Sure, we can keep the old behaviour for scenario#3. However, I am just not sure
where the data will be written to when the new client tries to append to the
file whose last block doesn't even have one replica. That seems a bug to me.
This is just my speculation. We could address this in a different jira though.

That said, for the current patch, we still need to fix here:

{code}
switch(lastBlockState) {
case COMPLETE:
assert false : Already checked that the last block is incomplete;
break;
{code}
The above code does {{asser false Already checked that the last block is
incomplete}}, which is wrong, it should not assert false, because this is a
valid case when penultimate block is COMMITTED and last block is COMPLETE.
Right?

Thanks.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-17 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215703#comment-14215703
 ] 

Yongjun Zhang commented on HDFS-4882:
-

HI Jing,

I gave some more thoughts on the block of code I quoted in my last comment, I 
think some additional change is still needed to ensure the lease is recovered 
for the case  when penultimate block is COMMITTED and last block is COMPLETE 
(call it caseX below), in addition to the {{assert false}} I talked about in my 
last comment. Commented below.

Hi Ravi,

I did a review of your latest patch (v4), and have the following comments. Most 
important one is, even though the infinite loop is removed by the patch, the 
lease for caseX is still not released, if the penultimate block stays in 
COMMITTED state. We should release the lease here.

* The change in LeaseManager looks good to me. There is an option not to make 
this change it if we do below.
* For caseX, to make sure the lease is released, we need to do something like 
below
{code}
   boolean penultimateBlockMinReplication = penultimateBlock == null ? true :
blockManager.checkMinReplication(penultimateBlock);
   BlockUCState penultimateBlockState = penultimateBlock == null ?  
BlockUCState.COMPLETE:
penultimateBlock.getBlockUCState();
   String blockStateStr;
   ..
case COMPLETE:
case COMMITTED:
  blockStateStr =  + (penultimateBlockState= + penultimateBlockState +  
lastBlockState=+lastBlockState + );
  // Close file if committed blocks are minimally replicated
  if(penultimateBlockMinReplication 
  blockManager.checkMinReplication(lastBlock)) {
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId());
NameNode.stateChangeLog.warn(BLOCK*
  +  internalReleaseLease: Committed blocks are minimally replicated,
  +  lease removed, file closed 
  + blockStateStr);
return true;  // closed!
  }
  // Cannot close file right now, since some blocks 
  // are not yet minimally replicated.
  // This may potentially cause infinite loop in lease recovery
  // if there are no valid replicas on data-nodes.
  String message = DIR* NameSystem.internalReleaseLease:  +
  Failed to release lease for file  + src +
  . Committed blocks are waiting to be minimally replicated  +
  blockStateStr + , Try again later.;
  NameNode.stateChangeLog.warn(message);
  throw new AlreadyBeingCreatedException(message);
{code}

Basically I suggest to let the two cases to share the same code, and included 
both block's state in the message to distinguish, this handles the different 
scenarios like below.

||  || BlockState || BlockState || BlockState ||BlockState || BlockState || 
BlockState ||
|penultimateBlock | COMPLETE |COMMITTED|COMMITTED|COMPLETE |COMMITTED|COMMITTED|
|lastBlock|COMMITTED|COMMITTED | COMPLETE|COMMITTED|COMMITTED | COMPLETE|
|minReplicaSatisfied|Yes|Yes|Yes|No|No|No|
| 
Solution|CloseFile+ReleaseLease|CloseFile+ReleaseLease|CloseFile+ReleaseLease|ReleaseLease|ReleaseLease|ReleaseLease|
 
Do you and other folks think my proposal makes sense? Thanks a lot.


 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-16 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213992#comment-14213992
 ] 

Yongjun Zhang commented on HDFS-4882:
-

HI,

Thanks Zesheng for reporting the issue, Ravi for working on the solution and 
other folks for reviewing.

I was looking into an infinite loop case when doing checkLeases myself, and 
figured out that the logic in FSNamesystem#internalReleaseLease
{code}
   switch(lastBlockState) {
case COMPLETE:
  assert false : Already checked that the last block is incomplete;
  break;
{code}

doesn't take care of the case that penultimate block is COMMITTED and final 
block is COMPLETE, thus caused the infinite loop. Looking at the history of 
this jira, I found [~jingzhao] suggested the same at 
https://issues.apache.org/jira/browse/HDFS-4882?focusedCommentId=14207202page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14207202

I did some analysis to share here (sorry for a long post).

When the final block is COMMITTED, the current implementation does the 
following:
{code}
case COMMITTED:
  // Close file if committed blocks are minimally replicated  
=== senario#1
  if(penultimateBlockMinReplication 
  blockManager.checkMinReplication(lastBlock)) {
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId());
NameNode.stateChangeLog.warn(BLOCK*
  +  internalReleaseLease: Committed blocks are minimally replicated,
  +  lease removed, file closed.);
return true;  // closed!
  }
  // Cannot close file right now, since some blocks   
 scenario#2
  // are not yet minimally replicated.
  // This may potentially cause infinite loop in lease recovery
  // if there are no valid replicas on data-nodes.
  String message = DIR* NameSystem.internalReleaseLease:  +
  Failed to release lease for file  + src +
  . Committed blocks are waiting to be minimally replicated. +
   Try again later.;
  NameNode.stateChangeLog.warn(message);
  throw new AlreadyBeingCreatedException(message);
{code}

What it does:

* For scenario#1, check minReplication for both penultimate and last block, if 
satisifed, finalize the block (recover lease, close file)

* For scenario#2, throw AlreadyBeingCreatedException derived from IOException 
(the name of this exception appears to be a misnomer, maybe we should fix 
later). 

To solve the case that penultimate block is COMMITTED and final block is 
COMPLETE, I'd suggest to make some changes on top of the submitted patch (for 
further discussion):

For scenario#1, we can do the same as when the last block is COMMITTED, as 
described above.

For scenario#2,   I think we have two options:

# option A, drop the code in the existing code that handles scenario#2 (not to 
throw the exception), let checkLeases check back again (2 second is current 
internal), waiting for block report to finish to change the minimal replication 
situation then recover the lease. The infinite loop could still happen if the 
minimal replication never get satisfied. But this would be rare assuming the 
minimal replication can be satisfied eventually.
# option B, do the similar logic as in the existing code (throwing 
AlreadyBeingCreatedException). There is an issue for this option too that I can 
see, and described below.

With option B, look at the caller side (LeaseManager#checkLeases, whenever an 
IOException is caught, it just go ahead removing the lease. So the possible 
infinite loop described in the scenario#2 comment will not happen because of 
the lease removal (lease recovered). 

But the problem with option B is, after the lease removal, the file may still 
have blocks not satisfying minimal replication (scenario#2), which would be a 
potential issue. This situation exists in current implementation when handling 
the case that the last block is COMMITTED. I think we should we wait for 
minimal replication to be satisfied before recovering the lease. 

So looks like option A is more preferable. But the original code tries to 
recover the lease immediately, I'm not sure whether there is any catch here. 

Comments, thoughts?

Thanks again.


 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-16 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213999#comment-14213999
]

Yongjun Zhang commented on HDFS-4882:
-

BTW, the scenario reported in the jira description is one special one. The
infinite loop could happen whenever penultimate block is COMMITTED and final
block is COMPLETE for other scenarios. That is, the LeaseManager#checkLeases
method is holding FSNamesystem.writeLock() all the time because of the infinite
loop, all the threads in NN that process block report will be blocked waiting
for FSNamesystem.writeLock(). So even if a block report is ready to satisfy the
minimal replication, it won't be processed.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-14 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213109#comment-14213109
]

Ravi Prakash commented on HDFS-4882:

These test errors are valid. They are happening because pollFirst() retrieved
*and removes* the first element. Sorry for the oversight. Will upload a new
patch soon

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213324#comment-14213324
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681664/HDFS-4882.4.patch
  against trunk revision 4fb96db.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8745//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8745//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-13 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211550#comment-14211550
 ] 

Colin Patrick McCabe commented on HDFS-4882:


The patch looks good.

bq. Here's a patch which uses SortedSet.tailSet. However I still like the 
earlier patch more (because its a genuine case of two threads accessing the 
same data-structure). With tailSet we are just trying to build our own 
synchronization mechanism (which is likely more inefficient than the 
ConcurrentinternalReleaseLease) .

I have to admit that I find the locking to be confusing in {{LeaseManager}}.  
It seems like some methods which access the sets are synchronized, and some are 
not.  And we are exposing the sets to outside code, which accesses them without 
synchronization.  It's very inconsistent and we should file a follow-up JIRA to 
look into this.  I don't think there is necessarily a bug there, though... I 
think it's possible that the access is actually gated on the FSN lock.  If 
that's true, we should document it and remove the synchronized blocks in 
{{LeaseManager}} (since in that case they would not be doing anything useful).

Concurrent sets are not a magic bullet since we still may have to deal with 
issues like Lease objects being deactivated while another caller holds on to 
them, etc.  If we dig too far into locking, this patch will get a lot bigger 
and a lot messier.  I think that we should keep this patch small and focused on 
the problem.

In your patch is that you are requiring the FSN lock to be held in 
{{LeaseManager#getNumUnderConstructionBlocks}}, but you don't document that 
anywhere.  Can you please add that to a JavaDoc or other comment for this 
method?

{code}
+Lease leaseToCheck = null;
+try {
+  leaseToCheck = sortedLeases.first();
+} catch(NoSuchElementException e) {}
{code}
You can replace this with {{leaseToCheck = sortedLeases.pollFirst();}}

{code}
+  try {
+leaseToCheck = sortedLeases.tailSet(leaseToCheck, false).first();
+  } catch(NoSuchElementException e) { 
+leaseToCheck = null;
   }
{code}
You don't need this.  Just use {{leaseToCheck = 
sortedLeases.higher(leaseToCheck)}}.

bq. I'd also request for this to make it into 2.6.0 because of this issue's 
severity.

I'm fine with the second version going into 2.6.  [~acmurthy], others, what do 
you think?

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1,

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211932#comment-14211932
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681481/HDFS-4882.3.patch
  against trunk revision d005404.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
  org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
  org.apache.hadoop.hdfs.TestFileCreation

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8735//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8735//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8735//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-12 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208478#comment-14208478
]

Ravi Prakash commented on HDFS-4882:

s/ConcurrentinternalReleaseLease/ConcurrentSkipList/

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-12 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208622#comment-14208622
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681116/HDFS-4882.2.patch
  against trunk revision be7bf95.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerDynamicBehavior
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8723//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8723//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-12 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208657#comment-14208657
]

Ravi Prakash commented on HDFS-4882:

These unit tests failures are spurious and unrelated to the code changes in the
patch.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-11 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206930#comment-14206930
]

Ravi Prakash commented on HDFS-4882:

I'll try to refactor the code to use SortedSet anyway.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-11 Thread Jing Zhao (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207202#comment-14207202
]

Jing Zhao commented on HDFS-4882:
-

bq. In this case, for a very strange reason which I have as yet not been able
to uncover, the FSNamesystem wasn't able to recover the lease.

According to [~wuzesheng]'s earlier analysis, this may happen when the
penultimate block is in COMMITTED state while the last block is in COMPLETE
state. The following code in FSNamesystem#internalReleaseLease may have some
issue:
{code}
switch(lastBlockState) {
case COMPLETE:
assert false : Already checked that the last block is incomplete;
break;
{code}
Since it is possible that the penultimate block is in COMMITTED state while the
last block is in COMPLETE state, the logic of the above code may be wrong.
Instead, I think we can combine the COMPLETE and COMMITTED case in the switch
expression, so that the current replication factor of both the last and the
penultimate blocks are checked.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-11 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207393#comment-14207393
]

Ravi Prakash commented on HDFS-4882:

Thanks Jing! I was trying to create a unit test to replicate this failure.
Please feel free to upload a patch on HDFS-7342 if you can.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-10 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205138#comment-14205138
 ] 

Colin Patrick McCabe commented on HDFS-4882:


If I understand this correctly, the issue here is that there is an expired 
lease which is not getting removed from {{sortedLeases}}, causing 
{{LeaseManager#checkLeases}} to loop forever.  Maybe this is a dumb question, 
but shouldn't we fix the code so that this expired lease does get removed?  Is 
this patch really fixing the root issue?  It seems like if we let expired 
leases stick around forever we may have other problems.

Also, this patch seems to replace a {{SortedSet}} with a 
{{ConcurrentSkipListSet}}.  We don't need the overhead of a concurrent set 
here... the set is only modified while holding the lock.  If you want to modify 
while iterating, you can simply use an {{Iterator}} for this purpose.  Or, 
since the set is sorted, you can use {{SortedSet#tailSet}} to find the element 
after the previous element you were looking at.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-10 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205669#comment-14205669
]

Ravi Prakash commented on HDFS-4882:

Thanks for your review Colin! Your understanding is correct. In this case, for
a very strange reason which I have as yet not been able to uncover, the
FSNamesystem wasn't able to recover the lease. I am investigating this root
issue in HDFS-7342. In the meantime, however I'd argue that the Namenode should
never enter an infinite loop for whatever reason, and instead of assuming that
we have fixed all possible reasons why a lease couldn't be recovered, we should
relinquish the lock regularly. We should display on the webUI how many files
are open for writing and allow ops to forcibly close open files (HDFS-7307) .
The way in which this error happens (NN suddenly stops working) is egregious.

sortedLeases is being used externally in FSNamesystem.getCompleteBlocksTotal()
as well . We were also actively modifying it in checkLeases. I'm sure we can
move things around to keep using SortedSets, but I don't know if this
Collection will ever really become too big for the performance difference to
matter. What do you think?

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-03 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195227#comment-14195227
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12678992/HDFS-4882.1.patch
  against trunk revision 67f13b5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8628//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8628//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-03 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195675#comment-14195675
]

Ravi Prakash commented on HDFS-4882:

Can someone please review and commit this JIRA?

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-03 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195678#comment-14195678
]

Ravi Prakash commented on HDFS-4882:

I've filed HDFS-7342 to investigate why the lease wasn't recovered.
[~wuzesheng] and [~umamaheswararao] were you ever able to consistently produce
the situation in which leases weren't recovered? Perhaps as a unit test?

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-10-30 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191183#comment-14191183
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12678329/HDFS-4882.patch
  against trunk revision 5e3f428.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.TestEditLog

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestDFSClientRetries
org.apache.hadoop.hdfs.TestLeaseRecovery2

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8605//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8605//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-10-29 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189396#comment-14189396
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12586700/4882.patch
  against trunk revision d33e07d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.namenode.TestDeleteRace

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8589//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8589//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha
Reporter: Zesheng Wu
 Attachments: 4882.1.patch, 4882.patch, 4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2013-07-25 Thread Shrijeet Paliwal (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719847#comment-13719847
]

Shrijeet Paliwal commented on HDFS-4882:

Just wanted to report that we have been hit by this (or an issue that produces
exactly same symptoms) twice in 3 days.

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2013-07-02 Thread Uma Maheswara Rao G (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697743#comment-13697743
]

Uma Maheswara Rao G commented on HDFS-4882:
---

{quote}
When there's DN in the pipeline down and the pipeline stage is PIPELINE_CLOSE,
the client triggers the data replication, do not wait the NN to do this(NN
needs the file be finalized to do the replication, but finalized need all the
blocks have at least dfs.namenode.replication.min(=2) replicas, these two
conditions are contradicting).
{quote}
What you mean by 'the client triggers the data replication'?
File will not be finalized until it reached at least min replication blocks.
But block replication can be started once it is committed to DN.(here each DN
will finalize its block and report NN). On reaching min replication NN will
complete that block. If all Blocks in NN are in complete state then file can be
closed normally.

Namenode LeaseManager checkLeases() runs into infinite loop
---

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2013-07-02 Thread Zesheng Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697784#comment-13697784
 ] 

Zesheng Wu commented on HDFS-4882:
--

1. {quote}What you mean by 'the client triggers the data replication'?{quote}
I mean let the client goto the following replica transfer process:
{code}
  //transfer replica
  final DatanodeInfo src = d == 0? nodes[1]: nodes[d - 1];
  final DatanodeInfo[] targets = {nodes[d]};
  transfer(src, targets, lb.getBlockToken());
{code}

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha
Reporter: Zesheng Wu
 Attachments: 4882.1.patch, 4882.patch, 4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2013-07-02 Thread Zesheng Wu (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697787#comment-13697787
]

Zesheng Wu commented on HDFS-4882:
--

{quote}
File will not be finalized until it reached at least min replication blocks.
But block replication can be started once it is committed to DN.(here each DN
will finalize its block and report NN). On reaching min replication NN will
complete that block. If all Blocks in NN are in complete state then file can be
closed normally.
{quote}
Yes, I think you are right. But the block replication is not committed to DN,
so is not started.

Namenode LeaseManager checkLeases() runs into infinite loop
---

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2013-06-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678330#comment-13678330
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12586700/4882.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/4497//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4497//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha
Reporter: Zesheng Wu
 Attachments: 4882.1.patch, 4882.patch, 4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2013-06-06 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676941#comment-13676941
]

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12586483/4882.1.patch
against trunk revision .

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4485//console

This message is automatically generated.

Namenode LeaseManager checkLeases() runs into infinite loop
---

62 matches

Mail list logo