Nicolas Fraison created HDFS-10220:
--------------------------------------
Summary: Namenode failover due to too long loking in
LeaseManager.Monitor
Key: HDFS-10220
URL: https://issues.apache.org/jira/browse/HDFS-10220
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Nicolas Fraison
Priority: Minor
I have faced a namenode failover due to unresponsive namenode detected by the
zkfc with lot's of WARN messages (5 millions) like this one:
_org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing
blocks are COMPLETE, lease removed, file closed._
On the threaddump taken by the zkfc there are lots of thread blocked due to a
lock.
Looking at the code, there are a lock taken by the LeaseManager.Monitor when
some lease must be released. Due to the really big number of lease to be
released the namenode has taken too many times to release them blocking all
other tasks and making the zkfc thinking that the namenode was not
available/stuck.
The idea of this patch is to limit the number of leased released each time we
check for lease so the lock won't be taken for a too long time period.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)