[ 
https://issues.apache.org/jira/browse/HBASE-26963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah reassigned HBASE-26963:
------------------------------------

    Assignee: Rushabh Shah

> ReplicationSource#removePeer hangs if we try to remove bad peer.
> ----------------------------------------------------------------
>
>                 Key: HBASE-26963
>                 URL: https://issues.apache.org/jira/browse/HBASE-26963
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Replication
>    Affects Versions: 2.4.11
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>         Attachments: HBASE-26963.patch
>
>
> ReplicationSource#removePeer hangs if we try to remove bad peer.
> Steps to reproduce:
> 1. Set config replication.source.regionserver.abort to false so that it 
> doesn't abort regionserver.
> 2. Add a dummy peer.
> 2. Remove that peer.
> RemovePeer call will hang indefinitely until the test times out.
> Attached a patch to reproduce the above behavior.
> I can see following threads in the stack trace:
> {noformat}
> "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1"
>  #339 daemon prio=5 os_prio=31 tid=0x00007f8caa
> 44a800 nid=0x22107 waiting on condition [0x00007000107e5000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:511)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:577)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$startup$4(ReplicationSource.java:633)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$$Lambda$350/89698794.uncaughtException(Unknown
>  Source)
>         at java.lang.Thread.dispatchUncaughtException(Thread.java:1959)
> {noformat}
> {noformat}
> "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0" #338 daemon prio=5 
> os_prio=31 tid=0x00007f8ca82fa800 nid=0x22307 in Object.wait() 
> [0x00007000106e2000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Thread.join(Thread.java:1260)
>         - locked <0x0000000799975ea0> (a java.lang.Thread)
>         at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:106)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:674)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:657)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:652)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:647)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:330)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.removePeer(PeerProcedureHandlerImpl.java:56)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:61)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
>         at 
> org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
>         at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> {noformat}
> "Listener at localhost/55013" #20 daemon prio=5 os_prio=31 
> tid=0x00007f8caf95a000 nid=0x6703 waiting on condition [0x0000700002
> 544000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3442)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3372)
>         at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
>         at 
> org.apache.hadoop.hbase.client.Admin.removeReplicationPeer(Admin.java:2861)
>         at 
> org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.cleanPeer(TestBadReplicationPeer.java:74)
>         at 
> org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.testWrongReplicationEndpoint(TestBadReplicationPeer.java:66)
> {noformat}
> The main thread "TestBadReplicationPeer.testWrongReplicationEndpoint" is 
> waiting for Admin#removeReplicationPeer.
> The refreshPeer thread (PeerProcedureHandlerImpl#removePeer) responsible to 
> terminate peer (#338) is waiting on ReplicationSource thread to be terminated.
> The ReplicateSource thread (#339) is in sleeping state. Notice that this 
> thread's stack trace is in ReplicationSource#uncaughtException method.
> When we call ReplicationSourceManager#removePeer, we set sourceRunning flag 
> to false, send an interrupt signal to ReplicationSource thread 
> [here|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L668-L674].
>  In this case  ReplicationSource was waiting to read cluster id of the peer 
> and it received an InterruptedException.
> {noformat}
> 2022-04-20 08:46:49,679 WARN  
> [RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1]
>  zookeeper.ZKUtil(228): connection to cluster: dummypeer_1-0x100229efa200009, 
> quorum=127.0.0.1:55599, baseZNode=/1 Unable to set watcher on znode 
> (/1/hbaseid)
> java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at java.lang.Object.wait(Object.java:502)
>       at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529)
>       at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2016)
>       at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:212)
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:221)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
>       at 
> org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:112)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:571)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> [ZKClusterId.readClusterIdZNode|https://github.com/apache/hbase/blob/branch-2.4/hbase-zookeeper/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKClusterId.java#L69-L72]
>   catches InterruptedException and returns null.
> ReplicationSource realizes that sourceRunning flag is set to false and it 
> will throw IllegalStateException 
> [here|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L561-L565].
> Then the control goes to 
> [UncaughtExceptionHandler|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L620-L640]
>  and since abortOnError is set to false, it will go into infinite sleep 
> causing the test to hang.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to