[ https://issues.apache.org/jira/browse/HBASE-21518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703332#comment-16703332 ]
Peter Somogyi commented on HBASE-21518: --------------------------------------- ServerManager#expireServer checks if cluster is shutting down and in this case does not create ServerCrashProcedure for dead servers. This AtomicBoolean variable is set to true when ServerManager#shutdownCluster method is called, however, there are 2 ServerManager instances and on expireServer a different one is checked. I added some debug logs with hashcodes where you can see that clusterShutdown was set to true (hash=1980707837) but later on during shutdown the variable contains false (hash=416244779) that's why ServerCrashProcedure is created which hangs since there are no Master. {noformat} 2018-11-29 15:43:21,948 INFO [Thread-81] master.ServerManager(160): ServerManager initialized. clusterShutdown false, thread 210, hash 1980707837 2018-11-29 15:43:23,929 INFO [Thread-81] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 210, hash=1980707837 2018-11-29 15:43:23,929 INFO [Thread-81] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 210, hash=1980707837 2018-11-29 15:43:29,732 INFO [Thread-80] master.ServerManager(160): ServerManager initialized. clusterShutdown false, thread 209, hash 416244779 2018-11-29 15:43:29,820 INFO [Thread-80] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 209, hash=416244779 2018-11-29 15:43:29,820 INFO [Thread-80] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 209, hash=416244779 2018-11-29 15:43:29,937 INFO [Time-limited test] master.ServerManager(904): Set clusterShutdown to true, thread 14, hash 1980707837 2018-11-29 15:43:30,985 INFO [RegionServerTracker-0] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 461, hash=416244779 2018-11-29 15:43:30,986 INFO [RegionServerTracker-0] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 461, hash=416244779 2018-11-29 15:48:29,851 INFO [master/172.30.65.195:0.Chore.1] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 417, hash=416244779 2018-11-29 15:48:29,852 INFO [master/172.30.65.195:0.Chore.1] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 417, hash=416244779 2018-11-29 15:48:29,852 INFO [master/172.30.65.195:0.Chore.1] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 417, hash=416244779 2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 417, hash=416244779 2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 417, hash=416244779 2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1] master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false, thread 417, hash=416244779{noformat} > TestMasterFailoverWithProcedures is flaky > ----------------------------------------- > > Key: HBASE-21518 > URL: https://issues.apache.org/jira/browse/HBASE-21518 > Project: HBase > Issue Type: Bug > Affects Versions: 2.2.0, 2.0.3, 2.1.2 > Reporter: Peter Somogyi > Assignee: Peter Somogyi > Priority: Major > Attachments: output.txt > > > TestMasterFailoverWithProcedures test is failing frequently, times out. I > faced this failure on 2.0.3RC0 vote and it also appears on multiple flaky > dashboards. > branch-2: > [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2/2007/] > branch-2.1: > [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.1/2002/] > branch-2.0: > [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.0/1988/] > > {noformat} > [INFO] Running > org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures > [ERROR] Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: > 780.648 s <<< FAILURE! - in > org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures > [ERROR] > org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures > Time elapsed: 749.024 s <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 780 > seconds > at > org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures.tearDown(TestMasterFailoverWithProcedures.java:86) > [ERROR] > org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures > Time elapsed: 749.051 s <<< ERROR! > java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-3-2 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)