[ 
https://issues.apache.org/jira/browse/HBASE-21518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703332#comment-16703332
 ] 

Peter Somogyi commented on HBASE-21518:
---------------------------------------

ServerManager#expireServer checks if cluster is shutting down and in this case 
does not create ServerCrashProcedure for dead servers. This AtomicBoolean 
variable is set to true when ServerManager#shutdownCluster method is called, 
however, there are 2 ServerManager instances and on expireServer a different 
one is checked.

I added some debug logs with hashcodes where you can see that clusterShutdown 
was set to true (hash=1980707837) but later on during shutdown the variable 
contains false (hash=416244779) that's why ServerCrashProcedure is created 
which hangs since there are no Master.
{noformat}
2018-11-29 15:43:21,948 INFO [Thread-81] master.ServerManager(160): 
ServerManager initialized. clusterShutdown false, thread 210, hash 1980707837
2018-11-29 15:43:23,929 INFO [Thread-81] master.ServerManager(913): Called 
isClusterShutdown. clusterShutdown=false, thread 210, hash=1980707837
2018-11-29 15:43:23,929 INFO [Thread-81] master.ServerManager(913): Called 
isClusterShutdown. clusterShutdown=false, thread 210, hash=1980707837
2018-11-29 15:43:29,732 INFO [Thread-80] master.ServerManager(160): 
ServerManager initialized. clusterShutdown false, thread 209, hash 416244779
2018-11-29 15:43:29,820 INFO [Thread-80] master.ServerManager(913): Called 
isClusterShutdown. clusterShutdown=false, thread 209, hash=416244779
2018-11-29 15:43:29,820 INFO [Thread-80] master.ServerManager(913): Called 
isClusterShutdown. clusterShutdown=false, thread 209, hash=416244779
2018-11-29 15:43:29,937 INFO [Time-limited test] master.ServerManager(904): Set 
clusterShutdown to true, thread 14, hash 1980707837
2018-11-29 15:43:30,985 INFO [RegionServerTracker-0] master.ServerManager(913): 
Called isClusterShutdown. clusterShutdown=false, thread 461, hash=416244779
2018-11-29 15:43:30,986 INFO [RegionServerTracker-0] master.ServerManager(913): 
Called isClusterShutdown. clusterShutdown=false, thread 461, hash=416244779
2018-11-29 15:48:29,851 INFO [master/172.30.65.195:0.Chore.1] 
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, 
thread 417, hash=416244779
2018-11-29 15:48:29,852 INFO [master/172.30.65.195:0.Chore.1] 
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, 
thread 417, hash=416244779
2018-11-29 15:48:29,852 INFO [master/172.30.65.195:0.Chore.1] 
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, 
thread 417, hash=416244779
2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1] 
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, 
thread 417, hash=416244779
2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1] 
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, 
thread 417, hash=416244779
2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1] 
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, 
thread 417, hash=416244779{noformat}

> TestMasterFailoverWithProcedures is flaky
> -----------------------------------------
>
>                 Key: HBASE-21518
>                 URL: https://issues.apache.org/jira/browse/HBASE-21518
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.0, 2.0.3, 2.1.2
>            Reporter: Peter Somogyi
>            Assignee: Peter Somogyi
>            Priority: Major
>         Attachments: output.txt
>
>
> TestMasterFailoverWithProcedures test is failing frequently, times out. I 
> faced this failure on 2.0.3RC0 vote and it also appears on multiple flaky 
> dashboards.
> branch-2: 
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2/2007/]
> branch-2.1: 
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.1/2002/]
> branch-2.0: 
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.0/1988/]
>   
> {noformat}
> [INFO] Running 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> [ERROR] Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 
> 780.648 s <<< FAILURE! - in 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> [ERROR] 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures  
> Time elapsed: 749.024 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 780 
> seconds
>       at 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures.tearDown(TestMasterFailoverWithProcedures.java:86)
> [ERROR] 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures  
> Time elapsed: 749.051 s  <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-3-2
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to