Ming Ma created HDFS-6184:
-----------------------------

             Summary: Better health check from ZKFC
                 Key: HDFS-6184
                 URL: https://issues.apache.org/jira/browse/HDFS-6184
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Ming Ma


We have seen several false positives in terms of when ZKFC considers NN to be 
unhealthy. Some of these triggers unnecessary failover. Examples,

1. SBN checkpoint caused ZKFC's RPC call into NN timeout. The consequence isn't 
bad; just that SBN will quit ZK membership and rejoin it later. But it is 
unnecessary. The reason is checkpoint acquires NN global write lock and all rpc 
requests are blocked. Even though HAServiceProtocol.monitorHealth doesn't need 
to acquire NN lock; it still needs to user service rpc queue.

2. When ANN is busy, sometimes the global lock can block other requests. ZKFC's 
RPC call timeout. This will trigger failover. The question is even if after the 
failover, the new ANN might run into similar issue.

We can increase ZKFC to NN timeout value to mitigate this to some degree. If 
ZKFC can be more accurate in judgment if NN is health or not and can predict 
the failover will help, that will be useful. For example, we can,

1. Have ZKFC made decision based on NN thread dump.
2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need to 
acquire NN global lock; so it can go through even if NN is doing checkpointing 
or very busy.

Any comments?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to