[ https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218247#comment-15218247 ]
Nicolas Fraison commented on HDFS-10220: ---------------------------------------- [~vinayrpet] the day we face this kind of failover we have faced multiple failover with the same issue on both namenodes. It happens after a bad action on the cluster removing the mapreduce.jobhistory.intermediate-done-dir folder whith then lots of mapreduce failing... Since we have applied this patch we have one time reached 250K lease to release taking a total time of 45 seconds (having 100 k lease treated per cycle). > Namenode failover due to too long loking in LeaseManager.Monitor > ---------------------------------------------------------------- > > Key: HDFS-10220 > URL: https://issues.apache.org/jira/browse/HDFS-10220 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: Nicolas Fraison > Priority: Minor > Attachments: HADOOP-10220.001.patch, threaddump_zkfc.txt > > > I have faced a namenode failover due to unresponsive namenode detected by the > zkfc with lot's of WARN messages (5 millions) like this one: > _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All > existing blocks are COMPLETE, lease removed, file closed._ > On the threaddump taken by the zkfc there are lots of thread blocked due to a > lock. > Looking at the code, there are a lock taken by the LeaseManager.Monitor when > some lease must be released. Due to the really big number of lease to be > released the namenode has taken too many times to release them blocking all > other tasks and making the zkfc thinking that the namenode was not > available/stuck. > The idea of this patch is to limit the number of leased released each time we > check for lease so the lock won't be taken for a too long time period. -- This message was sent by Atlassian JIRA (v6.3.4#6332)