Chris Nauroth created HDFS-9311:
-----------------------------------

             Summary: Support optional offload of NameNode HA service health 
checks to a separate RPC server.
                 Key: HDFS-9311
                 URL: https://issues.apache.org/jira/browse/HDFS-9311
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: ha, namenode
            Reporter: Chris Nauroth
            Assignee: Chris Nauroth


When a NameNode is overwhelmed with load, it can lead to resource exhaustion of 
the RPC handler pools (both client-facing and service-facing).  Eventually, 
this blocks the health check RPC issued from ZKFC, which triggers a failover.  
Depending on fencing configuration, the former active NameNode may be killed.  
In an overloaded situation, the new active NameNode is likely to suffer the 
same fate, because client load patterns don't change after the failover.  This 
can degenerate into flapping between the 2 NameNodes without real recovery.  If 
a NameNode had been killed by fencing, then it would have to transition through 
safe mode, further delaying time to recovery.

This issue proposes a separate, optional RPC server at the NameNode for 
isolating the HA health checks.  These health checks are lightweight operations 
that do not suffer from contention issues on the namesystem lock or other 
shared resources.  Isolating the RPC handlers is sufficient to avoid this 
situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to