[jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Ravi Prakash (JIRA) Wed, 30 Mar 2016 07:31:40 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218036#comment-15218036
 ]


Ravi Prakash commented on HDFS-10220:
-------------------------------------

Oh! I'm sorry in that case. Thank you for re-opening the JIRA. Just out of 
curiosity, how many files did you have open at one time? Do you think we should 
cycle through all the leases rather than the same set every iteration? We may 
be over-engineering, but I'd be interested in your opinion. I'll review your 
patch shortly

> Namenode failover due to too long loking in LeaseManager.Monitor
> ----------------------------------------------------------------
>
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Priority: Minor
>         Attachments: HADOOP-10220.001.patch, threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the 
> zkfc with lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All 
> existing blocks are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a 
> lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when 
> some lease must be released. Due to the really big number of lease to be 
> released the namenode has taken too many times to release them blocking all 
> other tasks and making the zkfc thinking that the namenode was not 
> available/stuck.
> The idea of this patch is to limit the number of leased released each time we 
> check for lease so the lock won't be taken for a too long time period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor

Reply via email to