[ https://issues.apache.org/jira/browse/HDFS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541216#comment-14541216 ]
Akira AJISAKA commented on HDFS-6184: ------------------------------------- +1, thanks [~mingma]. The test failures look unrelated to the patch. The checkstyle issue is due to {code} + public static final String DFS_HA_ZKFC_NN_HTTP_TIMEOUT_KEY = "dfs.ha.zkfc.nn.http.timeout.ms"; {code} We usually ignore the issue for defining parameter. > Capture NN's thread dump when it fails over > ------------------------------------------- > > Key: HDFS-6184 > URL: https://issues.apache.org/jira/browse/HDFS-6184 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Ming Ma > Assignee: Ming Ma > Attachments: HDFS-6184-2.patch, HDFS-6184-3.patch, HDFS-6184-4.patch, > HDFS-6184-5.patch, HDFS-6184-6.patch, HDFS-6184.patch > > > We have seen several false positives in terms of when ZKFC considers NN to be > unhealthy. Some of these triggers unnecessary failover. Examples, > 1. SBN checkpoint caused ZKFC's RPC call into NN timeout. The consequence > isn't bad; just that SBN will quit ZK membership and rejoin it later. But it > is unnecessary. The reason is checkpoint acquires NN global write lock and > all rpc requests are blocked. Even though HAServiceProtocol.monitorHealth > doesn't need to acquire NN lock; it still needs to user service rpc queue. > 2. When ANN is busy, sometimes the global lock can block other requests. > ZKFC's RPC call timeout. This will trigger failover. The question is even if > after the failover, the new ANN might run into similar issue. > We can increase ZKFC to NN timeout value to mitigate this to some degree. If > ZKFC can be more accurate in judgment if NN is health or not and can predict > the failover will help, that will be useful. For example, we can, > 1. Have ZKFC made decision based on NN thread dump. > 2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need > to acquire NN global lock; so it can go through even if NN is doing > checkpointing or very busy. > Any comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)