[jira] [Commented] (HBASE-19838) Can not shutdown backup master cleanly when it has already tried to become the active master
[ https://issues.apache.org/jira/browse/HBASE-19838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335498#comment-16335498 ] Hudson commented on HBASE-19838: FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4452 (See [https://builds.apache.org/job/HBase-Trunk_matrix/4452/]) HBASE-19838 Can not shutdown backup master cleanly when it has already (zhangduo: rev 970636c5afbd1a12a998af3e8b0825f806bedeca) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * (add) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestShutdownBackupMaster.java > Can not shutdown backup master cleanly when it has already tried to become > the active master > > > Key: HBASE-19838 > URL: https://issues.apache.org/jira/browse/HBASE-19838 > Project: HBase > Issue Type: Bug > Components: master >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19838-UT.patch, HBASE-19838.master.001.patch > > > This is the root cause that why TestZooKeeper hangs. > Open a new issue to introduce a UT which can reproduce the problem stably so > that we can fix the TestZooKeeper since it is not designed to test this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19838) Can not shutdown backup master cleanly when it has already tried to become the active master
[ https://issues.apache.org/jira/browse/HBASE-19838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335274#comment-16335274 ] Duo Zhang commented on HBASE-19838: --- +1. Let me commit. > Can not shutdown backup master cleanly when it has already tried to become > the active master > > > Key: HBASE-19838 > URL: https://issues.apache.org/jira/browse/HBASE-19838 > Project: HBase > Issue Type: Bug > Components: master >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19838-UT.patch, HBASE-19838.master.001.patch > > > This is the root cause that why TestZooKeeper hangs. > Open a new issue to introduce a UT which can reproduce the problem stably so > that we can fix the TestZooKeeper since it is not designed to test this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19838) Can not shutdown backup master cleanly when it has already tried to become the active master
[ https://issues.apache.org/jira/browse/HBASE-19838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335261#comment-16335261 ] Hadoop QA commented on HBASE-19838: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 8s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 42s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 8s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 50s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 56s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 23m 4s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}111m 32s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}156m 31s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-19838 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12907194/HBASE-19838.master.001.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 814124d4ebf8 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / d49357f265 | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/11158/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/11158/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This message was automatically generated. > Can not shutdown backup master cleanly when it has already tried to become > the active master > > > Key:
[jira] [Commented] (HBASE-19838) Can not shutdown backup master cleanly when it has already tried to become the active master
[ https://issues.apache.org/jira/browse/HBASE-19838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335083#comment-16335083 ] stack commented on HBASE-19838: --- Closing the shared clusterconnection on Master#shutdown seems to do the trick. This is how it looks w/ the nice test added here: {code:java} 2018-01-22 14:41:07,171 INFO [M:0;localhost:52959] regionserver.HRegionServer(1152): M:0;localhost:52959 exiting Exception in thread "M:0;localhost:52959" java.lang.IllegalStateException: Expected the service ClusterSchemaServiceImpl [FAILED] to be TERMINATED, but the service has FAILED at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitTerminated(AbstractService.java:318) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:576) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x7d85155 closed at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:722) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:714) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:684) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.getRegionLocation(ConnectionImplementation.java:562) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getRegionLocation(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:73) at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362) at org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1117) at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:427) at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:93) at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1062) at org.apache.hadoop.hbase.master.TestShutdownBackupMaster$MockHMaster.initClusterSchemaService(TestShutdownBackupMaster.java:67) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:924) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2026) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:557) ... 1 more{code} Studying use of clusterConnection in regionserver and master, calling close to kill any ongoing RPCs seems to be what we want. Study of shutdown in 'normal case' doesn't seem to change and runs 'normally'. > Can not shutdown backup master cleanly when it has already tried to become > the active master > > > Key: HBASE-19838 > URL: https://issues.apache.org/jira/browse/HBASE-19838 > Project: HBase > Issue Type: Bug > Components: master >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19838-UT.patch, HBASE-19838.master.001.patch > > > This is the root cause that why TestZooKeeper hangs. > Open a new issue to introduce a UT which can reproduce the problem stably so > that we can fix the TestZooKeeper since it is not designed to test this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19838) Can not shutdown backup master cleanly when it has already tried to become the active master
[ https://issues.apache.org/jira/browse/HBASE-19838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333976#comment-16333976 ] stack commented on HBASE-19838: --- {quote}And for the rpc, I think we need to find a way to close the ConnectionImplementation from outside so that all rpc will fail soon. {quote} Agreed. I'd opened HBASE-19834 Will give it a go... > Can not shutdown backup master cleanly when it has already tried to become > the active master > > > Key: HBASE-19838 > URL: https://issues.apache.org/jira/browse/HBASE-19838 > Project: HBase > Issue Type: Bug > Components: master >Reporter: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19838-UT.patch > > > This is the root cause that why TestZooKeeper hangs. > Open a new issue to introduce a UT which can reproduce the problem stably so > that we can fix the TestZooKeeper since it is not designed to test this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19838) Can not shutdown backup master cleanly when it has already tried to become the active master
[ https://issues.apache.org/jira/browse/HBASE-19838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333957#comment-16333957 ] Duo Zhang commented on HBASE-19838: --- {quote} Nice test. What you thinking? Add this and change TestZooKeeper? {quote} Yes. Let's fix TestZooKeeper first. It is not designed to test this problem it does not fail stably so also not a good UT for debugging. And for the rpc, I think we need to find a way to close the ConnectionImplementation from outside so that all rpc will fail soon. > Can not shutdown backup master cleanly when it has already tried to become > the active master > > > Key: HBASE-19838 > URL: https://issues.apache.org/jira/browse/HBASE-19838 > Project: HBase > Issue Type: Bug > Components: master >Reporter: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19838-UT.patch > > > This is the root cause that why TestZooKeeper hangs. > Open a new issue to introduce a UT which can reproduce the problem stably so > that we can fix the TestZooKeeper since it is not designed to test this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19838) Can not shutdown backup master cleanly when it has already tried to become the active master
[ https://issues.apache.org/jira/browse/HBASE-19838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333954#comment-16333954 ] stack commented on HBASE-19838: --- Nice test. What you thinking? Add this and change TestZooKeeper? We need to do same for all places master startup does an rpc to a region or server. > Can not shutdown backup master cleanly when it has already tried to become > the active master > > > Key: HBASE-19838 > URL: https://issues.apache.org/jira/browse/HBASE-19838 > Project: HBase > Issue Type: Bug > Components: master >Reporter: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19838-UT.patch > > > This is the root cause that why TestZooKeeper hangs. > Open a new issue to introduce a UT which can reproduce the problem stably so > that we can fix the TestZooKeeper since it is not designed to test this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19838) Can not shutdown backup master cleanly when it has already tried to become the active master
[ https://issues.apache.org/jira/browse/HBASE-19838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333943#comment-16333943 ] Duo Zhang commented on HBASE-19838: --- [~stack] FYI. This UT fails stably for me. Use a mocked HMaster implementation to hold in initClusterSchemaService and then we kill all the RSes and then let it go on. The UT will hang in HBaseTestingUtility.shutdownMiniCluster. > Can not shutdown backup master cleanly when it has already tried to become > the active master > > > Key: HBASE-19838 > URL: https://issues.apache.org/jira/browse/HBASE-19838 > Project: HBase > Issue Type: Bug > Components: master >Reporter: Duo Zhang >Priority: Critical > Fix For: 2.0.0-beta-2 > > Attachments: HBASE-19838-UT.patch > > > This is the root cause that why TestZooKeeper hangs. > Open a new issue to introduce a UT which can reproduce the problem stably so > that we can fix the TestZooKeeper since it is not designed to test this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)