[ https://issues.apache.org/jira/browse/HDFS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541796#comment-14541796 ]
Hudson commented on HDFS-6184: ------------------------------ SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/926/]) HDFS-6184. Capture NN's thread dump when it fails over. Contributed by Ming Ma. (aajisaka: rev 2463666ecb553dbde1b8c540a21ad3d599239acf) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSZKFailoverController.java * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java > Capture NN's thread dump when it fails over > ------------------------------------------- > > Key: HDFS-6184 > URL: https://issues.apache.org/jira/browse/HDFS-6184 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Ming Ma > Assignee: Ming Ma > Fix For: 2.8.0 > > Attachments: HDFS-6184-2.patch, HDFS-6184-3.patch, HDFS-6184-4.patch, > HDFS-6184-5.patch, HDFS-6184-6.patch, HDFS-6184.patch > > > We have seen several false positives in terms of when ZKFC considers NN to be > unhealthy. Some of these triggers unnecessary failover. Examples, > 1. SBN checkpoint caused ZKFC's RPC call into NN timeout. The consequence > isn't bad; just that SBN will quit ZK membership and rejoin it later. But it > is unnecessary. The reason is checkpoint acquires NN global write lock and > all rpc requests are blocked. Even though HAServiceProtocol.monitorHealth > doesn't need to acquire NN lock; it still needs to user service rpc queue. > 2. When ANN is busy, sometimes the global lock can block other requests. > ZKFC's RPC call timeout. This will trigger failover. The question is even if > after the failover, the new ANN might run into similar issue. > We can increase ZKFC to NN timeout value to mitigate this to some degree. If > ZKFC can be more accurate in judgment if NN is health or not and can predict > the failover will help, that will be useful. For example, we can, > 1. Have ZKFC made decision based on NN thread dump. > 2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need > to acquire NN global lock; so it can go through even if NN is doing > checkpointing or very busy. > Any comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)