[ https://issues.apache.org/jira/browse/HBASE-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912007#comment-16912007 ]
Hadoop QA commented on HBASE-22601: ----------------------------------- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 3s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-2 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 35s{color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 16s{color} | {color:green} branch-2 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 10s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green} branch-2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 24s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 22s{color} | {color:green} branch-2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 13s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 15m 40s{color} | {color:green} Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}274m 20s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 37s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}329m 12s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-500/16/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hbase/pull/500 | | JIRA Issue | HBASE-22601 | | Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 35615e1f549a 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-500/out/precommit/personality/provided.sh | | git revision | branch-2 / 9d178cf575 | | Default Java | 1.8.0_181 | | Test Results | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-500/16/testReport/ | | Max. process+thread count | 4522 (vs. ulimit of 10000) | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-500/16/console | | versions | git=2.11.0 maven=2018-06-17T18:33:14Z) findbugs=3.1.11 | | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org | This message was automatically generated. > Misconfigured addition of peers leads to cluster shutdown. > ---------------------------------------------------------- > > Key: HBASE-22601 > URL: https://issues.apache.org/jira/browse/HBASE-22601 > Project: HBase > Issue Type: Bug > Affects Versions: 1.3.2 > Reporter: Rushabh S Shah > Assignee: Rushabh S Shah > Priority: Major > > Recently we added a peer to a production cluster which were in different > kerberos realm. > *Steps to reproduce:* > 1. Add a misconfigured peer which is in different kerberos realm. > 2. Remove that peer. > 3. All region servers will start to crash. > *RCA* > Enabled trace logging on one Region server for a short amount of time. > After adding peer, saw the following log lines. > {noformat} > 2019-06-18 22:19:20,949 INFO [main-EventThread] > replication.ReplicationTrackerZKImpl - /hbase/replication/peers znode > expired, triggering peerListChanged event > 2019-06-18 22:19:20,992 INFO [main-EventThread] > replication.ReplicationPeersZKImpl - Added new peer > cluster=<dev-cluster-zookeeper>:/hbase > 2019-06-18 22:19:21,113 INFO [main-EventThread] > zookeeper.RecoverableZooKeeper - Process identifier=hconnection-0x794a56d6 > connecting to ZooKeeper ensemble=<dev-cluster-zookeeper> > 2019-06-18 22:20:01,280 WARN [main-EventThread] zookeeper.ZKUtil - > hconnection-0x794a56d6-0x16b56265fbebb1b, quorum=<dev-cluster-zookeeper>, > baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid) > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed for /hbase/hbaseid > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:123) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:220) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:421) > at > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65) > at > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:922) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:706) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:638) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238) > at > org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:432) > at > org.apache.hadoop.hbase.client.ConnectionManager.createConnectionInternal(ConnectionManager.java:341) > at > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:144) > at > org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.init(HBaseInterClusterReplicationEndpoint.java:135) > at > com.salesforce.hbase.replication.TenantReplicationEndpoint.init(TenantReplicationEndpoint.java:30) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.getReplicationSource(ReplicationSourceManager.java:517) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.addSource(ReplicationSourceManager.java:273) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.peerListChanged(ReplicationSourceManager.java:635) > at > org.apache.hadoop.hbase.replication.ReplicationTrackerZKImpl$PeersWatcher.nodeChildrenChanged(ReplicationTrackerZKImpl.java:192) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:643) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519) > 2019-06-18 22:20:42,999 WARN [Source,<dev-cluster>] zookeeper.ZKUtil - > connection to cluster: <dev-cluster>-0x26b56265fe7b5cd, > quorum=<dev-cluster-zookeeper>, baseZNode=/hbase Unable to set watcher on > znode (/hbase/hbaseid) > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed for /hbase/hbaseid > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:123) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:220) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:421) > at > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65) > at > org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96) > at > org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:104) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:304) > 2019-06-18 22:20:43,002 TRACE [Source,<dev-cluster>] > regionserver.ReplicationSource - Cannot contact the peer's zk ensemble, > sleeping 1000 times 1 > <same stack trace as before> > 2019-06-18 22:20:44,008 TRACE [Source,<dev-cluster>] > regionserver.ReplicationSource - Cannot contact the peer's zk ensemble, > sleeping 1000 times 2 > <same stack trace as before> > {noformat} > This goes on and on until we removed the peer. > After removing the peer, > {noformat} > 2019-06-18 22:21:20,731 INFO [main-EventThread] > replication.ReplicationTrackerZKImpl - /hbase/replication/peers/<dev-cluster> > znode expired, triggering peerRemoved event > 2019-06-18 22:21:20,731 INFO [main-EventThread] > regionserver.ReplicationSourceManager - Closing the following queue > <dev-cluster>, currently have 2 and another 0 that were recovered > 2019-06-18 22:21:20,733 INFO [main-EventThread] > regionserver.ReplicationSourceManager - Number of deleted recovered sources > for <dev-cluster>: 0 > 2019-06-18 22:21:20,734 INFO [main-EventThread] > regionserver.ReplicationSource - Closing source <dev-cluster> because: > Replication stream was removed by a user > 2019-06-18 22:21:20,734 INFO [main-EventThread] > replication.TenantReplicationEndpoint - Stopping endpoint > 2019-06-18 22:21:20,736 INFO [main-EventThread] > client.ConnectionManager$HConnectionImplementation - Closing zookeeper > sessionid=0x16b56265fbebb1b > 2019-06-18 22:21:20,736 DEBUG [main-EventThread] ipc.RpcClientImpl - Stopping > rpc client > 2019-06-18 22:21:20,738 INFO [main-EventThread] > regionserver.ReplicationSourceManager - Done with the queue <dev-cluster> > 2019-06-18 22:21:20,744 DEBUG [main-EventThread] > replication.ReplicationQueuesZKImpl - Peer > /hbase/replication/hfile-refs/<dev-cluster> not found in hfile reference > queue. > 2019-06-18 22:21:28,066 INFO [Source,<dev-cluster>] > regionserver.ReplicationSource - Replicating > d12a5eb3-16bd-4910-91ed-41e0e7990007 -> null > 2019-06-18 22:21:28,067 DEBUG [Source,<dev-cluster>] > regionserver.ReplicationSource - Starting up worker for wal group > <rs-serner-name>%2C60020%2C1560896050398 > 2019-06-18 22:21:28,069 ERROR [050398,<dev-cluster>] > regionserver.ReplicationSource - Unexpected exception in > ReplicationSourceWorkerThread, currentPath=null > java.lang.IllegalArgumentException: Peer with id= <dev-cluster> is not > connected > at > org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getStatusOfPeer(ReplicationPeersZKImpl.java:217) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.isPeerEnabled(ReplicationSource.java:363) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:549) > 2019-06-18 22:21:28,070 INFO [050398,<dev-cluster>] > regionserver.HRegionServer - STOPPED: Unexpected exception in > ReplicationSourceWorkerThread > 2019-06-18 22:21:28,071 INFO [/10.231.90.212:60020] > regionserver.SplitLogWorker - Sending interrupt to stop the worker thread > 2019-06-18 22:21:28,073 INFO [/10.231.90.212:60020] > regionserver.HRegionServer - Stopping infoServer > 2019-06-18 22:21:28,075 INFO [as-dnds4-3-prd:60020] > regionserver.SplitLogWorker - SplitLogWorker interrupted. Exiting. > 2019-06-18 22:21:28,076 INFO [as-dnds4-3-prd:60020] > regionserver.SplitLogWorker - SplitLogWorker > <rs-server-name>,60020,1560896050398 exiting > 2019-06-18 22:21:28,181 INFO [/10.231.90.212:60020] > regionserver.HeapMemoryManager - Stopping HeapMemoryTuner chore. > 2019-06-18 22:21:28,181 INFO [/10.231.90.212:60020] > flush.RegionServerFlushTableProcedureManager - Stopping region server flush > procedure manager gracefully. > {noformat} > Also verified from jstacks I captured between addition of peer and removal of > peer, the Replication source thread was sleeping waiting to connect to peer. > {noformat} > "main-EventThread.replicationSource,<dev-cluster>" #1923 daemon prio=5 > os_prio=0 tid=0x00007f1ad18f4800 nid=0x25a47 > waiting on condition [0x00007f1220314000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:349) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306) > {noformat} > According to me, the bug lies here: > {code:java|title=ReplicationSource.java|borderStyle=solid} > // Some comments here > public void run() { > { > .... > .... > while (this.isSourceActive() && this.peerClusterId == null) { > this.peerClusterId = replicationEndpoint.getPeerUUID(); > if (this.isSourceActive() && this.peerClusterId == null) { > if (sleepForRetries("Cannot contact the peer's zk ensemble", > sleepMultiplier)) { > sleepMultiplier++; > } > } > } > !-- There is a bug here. We should check that peerClusterId is not null. If > null, then terminate the thread. --> > // In rare case, zookeeper setting may be messed up. That leads to the > incorrect > // peerClusterId value, which is the same as the source clusterId > if (clusterId.equals(peerClusterId) && > !replicationEndpoint.canReplicateToSameCluster()) { > this.terminate("ClusterId " + clusterId + " is replicating to itself: > peerClusterId " > + peerClusterId + " which is not allowed by ReplicationEndpoint:" > + replicationEndpoint.getClass().getName(), null, false); > this.manager.closeQueue(this); > return; > } > .... > .... > } > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)