[ https://issues.apache.org/jira/browse/HDFS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ming Ma updated HDFS-6184: -------------------------- Labels: BB2015-05-RFC (was: BB2015-05-TBR) > Capture NN's thread dump when it fails over > ------------------------------------------- > > Key: HDFS-6184 > URL: https://issues.apache.org/jira/browse/HDFS-6184 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Ming Ma > Assignee: Ming Ma > Labels: BB2015-05-RFC > Attachments: HDFS-6184-2.patch, HDFS-6184-3.patch, HDFS-6184.patch > > > We have seen several false positives in terms of when ZKFC considers NN to be > unhealthy. Some of these triggers unnecessary failover. Examples, > 1. SBN checkpoint caused ZKFC's RPC call into NN timeout. The consequence > isn't bad; just that SBN will quit ZK membership and rejoin it later. But it > is unnecessary. The reason is checkpoint acquires NN global write lock and > all rpc requests are blocked. Even though HAServiceProtocol.monitorHealth > doesn't need to acquire NN lock; it still needs to user service rpc queue. > 2. When ANN is busy, sometimes the global lock can block other requests. > ZKFC's RPC call timeout. This will trigger failover. The question is even if > after the failover, the new ANN might run into similar issue. > We can increase ZKFC to NN timeout value to mitigate this to some degree. If > ZKFC can be more accurate in judgment if NN is health or not and can predict > the failover will help, that will be useful. For example, we can, > 1. Have ZKFC made decision based on NN thread dump. > 2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need > to acquire NN global lock; so it can go through even if NN is doing > checkpointing or very busy. > Any comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)