[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021139#comment-14021139 ] Zhijie Shen commented on HDFS-2949: --- This patch seems to break YARN-2075. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003474#comment-14003474 ] Hudson commented on HDFS-2949: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1780 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1780/]) HDFS-2949. Add check to active state transition to prevent operator-induced split brain. Contributed by Rushabh S Shah. (kihwal: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594709) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSHAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSHAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSHAAdminMiniCluster.java > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003297#comment-14003297 ] Hudson commented on HDFS-2949: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1754 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1754/]) HDFS-2949. Add check to active state transition to prevent operator-induced split brain. Contributed by Rushabh S Shah. (kihwal: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594709) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSHAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSHAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSHAAdminMiniCluster.java > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003262#comment-14003262 ] Hudson commented on HDFS-2949: -- FAILURE: Integrated in Hadoop-Yarn-trunk #562 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/562/]) HDFS-2949. Add check to active state transition to prevent operator-induced split brain. Contributed by Rushabh S Shah. (kihwal: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594709) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSHAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSHAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSHAAdminMiniCluster.java > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999019#comment-13999019 ] Hudson commented on HDFS-2949: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) HDFS-2949. Add check to active state transition to prevent operator-induced split brain. Contributed by Rushabh S Shah. (kihwal: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594709) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSHAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSHAAdmin.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSHAAdminMiniCluster.java > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997723#comment-13997723 ] Kihwal Lee commented on HDFS-2949: -- I've manually kicked precommit. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998002#comment-13998002 ] Kihwal Lee commented on HDFS-2949: -- +1 The patch looks good. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997928#comment-13997928 ] Hadoop QA commented on HDFS-2949: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644835/HDFS-2949-v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6899//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6899//console This message is automatically generated. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998010#comment-13998010 ] Rushabh S Shah commented on HDFS-2949: -- Thanks Kihwal for reviewing and committing the patch. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-2949-v2.patch, HDFS-2949-v3.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982212#comment-13982212 ] Hadoop QA commented on HDFS-2949: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642104/HDFS-2949-v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6749//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6749//console This message is automatically generated. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Attachments: HDFS-2949-v2.patch, HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981847#comment-13981847 ] Hadoop QA commented on HDFS-2949: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642029/HDFS-2949.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.tools.TestDFSHAAdmin {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6744//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6744//console This message is automatically generated. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, namenode >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Rushabh S Shah > Attachments: HDFS-2949.patch > > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230550#comment-13230550 ] Todd Lipcon commented on HDFS-2949: --- Another safety check here is to make sure that the transaction IDs match between the nodes before going active. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, name-node >Affects Versions: 0.24.0 >Reporter: Todd Lipcon > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208232#comment-13208232 ] Todd Lipcon commented on HDFS-2949: --- Yep, this is not supposed to solve issues, just to prevent a mistake in the common case. Fencing is the correct answer to prevent split brain in the general case. Asking for confirmation might be a nice improvement as well, so long as there's a --force option. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Todd Lipcon > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208231#comment-13208231 ] Uma Maheswara Rao G commented on HDFS-2949: --- {quote} That said, having the safety check described in this JIRA is still valuable, {quote} Agreed with this point to add safety checks. But anyway this can not solve 100% split barain scenarios right? (ex: small network breakage between active and standby and admin accidentally executed -transitiontoActive on standby.) I think this will be addressed in future as part of Automatic failover and shared storage fencing. But when admins deals directly with command line for some maintanence purpose, this case may occur right? Also for the apis transitionTo*, do we need to take the confirmation from the user before actually transitioning? this may give some more attention to the admin for proceeding. > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Todd Lipcon > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208168#comment-13208168 ] Todd Lipcon commented on HDFS-2949: --- I think we should probably un-document the transitionTo* commands, but leave them as a safety valve. It's nice to have direct access to these RPCs just in case there's some problem with one of the safer methods and you need a workaround without recompiling the client. That said, having the safety check described in this JIRA is still valuable, even using haadmin -failover, in case the admin has a messed up configuration in some way (eg the fencing script returns true but did not in fact fence the standby correctly) > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Todd Lipcon > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2949) HA: Add check to active state transition to prevent operator-induced split brain
[ https://issues.apache.org/jira/browse/HDFS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208161#comment-13208161 ] Hari Mankude commented on HDFS-2949: If -failover command can handle this situation and other situations correctly, why not deprecate -transitiontoActive entirely? > HA: Add check to active state transition to prevent operator-induced split > brain > > > Key: HDFS-2949 > URL: https://issues.apache.org/jira/browse/HDFS-2949 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Todd Lipcon > > Currently, if the administrator mistakenly calls "-transitionToActive" on one > NN while the other one is still active, all hell will break loose. We can add > a simple check by having the NN make a getServiceState() RPC to its peer with > a short (~1 second?) timeout. If the RPC succeeds and indicates the other > node is active, it should refuse to enter active mode. If the RPC fails or > indicates standby, it can proceed. > This is just meant as a preventative safety check - we still expect users to > use the "-failover" command which has other checks plus fencing built in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira