[ 
https://issues.apache.org/jira/browse/HDFS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-6184:
--------------------------
    Attachment: HDFS-6184-4.patch

Thanks [~ajisakaa]. Here is the updated patch with your suggestions. I also 
changes the name of the parameter to include the unit of time. We find the 
feature useful and thus enable the feature by default. But if people have 
concerns, we can set the default the timeout value to zero to disable this 
feature.

[~aw], FYI, I uploaded a new rebased patch earlier, jenkins haven't finished 
yet. The jenkins result you saw was for the earlier patch.

> Capture NN's thread dump when it fails over
> -------------------------------------------
>
>                 Key: HDFS-6184
>                 URL: https://issues.apache.org/jira/browse/HDFS-6184
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-6184-2.patch, HDFS-6184-3.patch, HDFS-6184-4.patch, 
> HDFS-6184.patch
>
>
> We have seen several false positives in terms of when ZKFC considers NN to be 
> unhealthy. Some of these triggers unnecessary failover. Examples,
> 1. SBN checkpoint caused ZKFC's RPC call into NN timeout. The consequence 
> isn't bad; just that SBN will quit ZK membership and rejoin it later. But it 
> is unnecessary. The reason is checkpoint acquires NN global write lock and 
> all rpc requests are blocked. Even though HAServiceProtocol.monitorHealth 
> doesn't need to acquire NN lock; it still needs to user service rpc queue.
> 2. When ANN is busy, sometimes the global lock can block other requests. 
> ZKFC's RPC call timeout. This will trigger failover. The question is even if 
> after the failover, the new ANN might run into similar issue.
> We can increase ZKFC to NN timeout value to mitigate this to some degree. If 
> ZKFC can be more accurate in judgment if NN is health or not and can predict 
> the failover will help, that will be useful. For example, we can,
> 1. Have ZKFC made decision based on NN thread dump.
> 2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need 
> to acquire NN global lock; so it can go through even if NN is doing 
> checkpointing or very busy.
> Any comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to