[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219746#comment-14219746
 ] 

Yongjun Zhang commented on HDFS-4882:
-------------------------------------

Hi Ravi,

Thanks for responding quickly!

I looked at the change, a minor suggestion:
{code}
470             if(leaseToCheck != sortedLeases.first()) {
471               LOG.warn("Unable to release the oldest lease: " + 
sortedLeases.first());
472             }
{code}
the lease that can not be released may not be the oldest in the loop. Maybe we 
can change to "Unable to release hard-limit expired lease: " to be more 
accurate. On the other hand, all older leases are released already in this 
checkReleases() run, this un-released lease will be the oldest lease 2 seconds 
later:-) 

Another thought is, in each checkLeases run, there might be multiple leases 
that can not be released, we only WARN one here with your latest change. If we 
replace the above code with a loop to check not only the first, but also the 
second, the third, until it hits leaseToCheck, then we are able to report all 
leases that are not released. Just a thought, not necessarily we have to do it 
in this jira, because having one is already a good indicator, and we do have 
the report every 2 second for all leases we attempt to release. 

Hi Colin: further comments?

Thanks.

 

> Namenode LeaseManager checkLeases() runs into infinite loop
> -----------------------------------------------------------
>
>                 Key: HDFS-4882
>                 URL: https://issues.apache.org/jira/browse/HDFS-4882
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client, namenode
>    Affects Versions: 2.0.0-alpha, 2.5.1
>            Reporter: Zesheng Wu
>            Assignee: Ravi Prakash
>            Priority: Critical
>         Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
> HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, 
> HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client 
> sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline 
> because of the current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after 
> written all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our 
> dfs.namenode.replication.min=2), and the client's close runs into indefinite 
> loop(HDFS-2936), and at the same time, NN makes the last block's state to 
> COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call 
> fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's 
> infinite loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
>  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
> blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
> for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to