[ 
https://issues.apache.org/jira/browse/HBASE-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076637#comment-17076637
 ] 

Huaxiang Sun commented on HBASE-24120:
--------------------------------------

Root cause analysis:

When remove peer happens, it may interrupt ReplicationSourceShipper. 

interruptOrAbortWhenFail will throw out a RunTimeException which is not handled 
and will abort Region Server. Once Region Server is aborted, the test will time 
out.
{code:java}
2020-04-03 04:20:47,336 ERROR 
[RS_REFRESH_PEER-regionserver/asf905:0-0.replicationSource,2.replicationSource.shipperasf905.gq1.ygridcore.net%2C38191%2C1585887556682,2]
 regionserver.ReplicationSource(397): Unexpected exception in 
RS_REFRESH_PEER-regionserver/asf905:0-0.replicationSource,2.replicationSource.shipperasf905.gq1.ygridcore.net%2C38191%2C1585887556682,2
 
currentPath=hdfs://localhost:39273/user/jenkins/test-data/2c4f98d9-b93f-6b0f-5e2d-7587de42e316/WALs/asf905.gq1.ygridcore.net,38191,1585887556682/asf905.gq1.ygridcore.net%2C38191%2C1585887556682.1585887560249
java.lang.RuntimeException: Thread is interrupted, the replication source may 
be terminated
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.interruptOrAbortWhenFail(ReplicationSourceManager.java:477)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:519)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:264)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:160)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:118)
2020-04-03 04:20:47,340 ERROR 
[RS_REFRESH_PEER-regionserver/asf905:0-0.replicationSource,2.replicationSource.shipperasf905.gq1.ygridcore.net%2C38191%2C1585887556682,2]
 helpers.MarkerIgnoringBase(159): ***** ABORTING region server 
asf905.gq1.ygridcore.net,38191,1585887556682: Unexpected exception in 
RS_REFRESH_PEER-regionserver/asf905:0-0.replicationSource,2.replicationSource.shipperasf905.gq1.ygridcore.net%2C38191%2C1585887556682,2
 *****
java.lang.RuntimeException: Thread is interrupted, the replication source may 
be terminated
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.interruptOrAbortWhenFail(ReplicationSourceManager.java:477)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:519)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:264)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:160)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:118)
2020-04-03 04:20:47,341 ERROR 
[RS_REFRESH_PEER-regionserver/asf905:0-0.replicationSource,2.replicationSource.shipperasf905.gq1.ygridcore.net%2C38191%2C1585887556682,2]
 helpers.MarkerIgnoringBase(143): RegionServer abort: loaded coprocessors are: 
[org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2020-04-03 04:20:47,349 INFO  
[RS_REFRESH_PEER-regionserver/asf905:0-0.replicationSource,2.replicationSource.shipperasf905.gq1.ygridcore.net%2C38191%2C1585887556682,2]
 regionserver.HRegionServer(2472): Dump of metrics as JSON on abort: { {code}
 

If I read the Jira HBASE-20561 correctly, it tries to avoid abort. The fix is 
not to throw out RunTimeException, instead, it just needs to log and let it be 
handled nicely by the main run loop (stop the thread).

> Flakey Test: TestReplicationAdminWithClusters timeout 
> ------------------------------------------------------
>
>                 Key: HBASE-24120
>                 URL: https://issues.apache.org/jira/browse/HBASE-24120
>             Project: HBase
>          Issue Type: Test
>          Components: Replication
>    Affects Versions: 2.3.0, master, 2.4.0
>            Reporter: Huaxiang Sun
>            Assignee: Hua Xiang
>            Priority: Major
>
> {code:java}
> 2020-04-05 23:36:53,092 ERROR 
> [RS_REFRESH_PEER-regionserver/asf905:0-0.replicationSource,2.replicationSource.shipperasf905.gq1.ygridcore.net%2C42849%2C1586129728118,2]
>  regionserver.ReplicationSource(397): Unexpected exception in 
> RS_REFRESH_PEER-regionserver/asf905:0-0.replicationSource,2.replicationSource.shipperasf905.gq1.ygridcore.net%2C42849%2C1586129728118,2
>  
> currentPath=hdfs://localhost:34203/user/jenkins/test-data/03854f9d-2780-eeaa-9645-c341240b62bf/WALs/asf905.gq1.ygridcore.net,42849,1586129728118/asf905.gq1.ygridcore.net%2C42849%2C1586129728118.1586129730509
> java.lang.RuntimeException: Thread is interrupted, the replication source may 
> be terminated
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.interruptOrAbortWhenFail(ReplicationSourceManager.java:477)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:519)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:264)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:160)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:118)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to