[ https://issues.apache.org/jira/browse/HBASE-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Stack resolved HBASE-25389. ----------------------------------- Fix Version/s: 2.4.1 2.5.0 3.0.0-alpha-1 Hadoop Flags: Reviewed Assignee: Michael Stack Resolution: Fixed Merged to branch-2.4+. Thanks for the review [~bharathv] > [Flakey Tests] branch-2 TestMetaShutdownHandler > ----------------------------------------------- > > Key: HBASE-25389 > URL: https://issues.apache.org/jira/browse/HBASE-25389 > Project: HBase > Issue Type: Task > Components: flakies > Reporter: Michael Stack > Assignee: Michael Stack > Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.1 > > > I see this in local runs fail regularly. We kill the server hosting meta and > then check it came up in a new location after waiting on recovery. In the > test, when it fails, the assert on new location fails because we have not > waited for the CRASH to happen. Here is excerpt from log: > {code} > 2020-12-11 13:20:27,298 INFO [Listener at localhost/62149] > master.TestMetaShutdownHandler(111): Deleted the znode for the RegionServer > hosting hbase:meta; waiting on SSH > ... > 2020-12-11 13:20:27,310 INFO [Listener at localhost/62149] > master.TestMetaShutdownHandler(122): Past wait on RIT > ... > 2020-12-11 13:20:27,351 DEBUG [RegionServerTracker-0] > procedure2.ProcedureExecutor(1048): Stored pid=9, > state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > stack.XXX.example.com,62201,1607721618377, splitWal=true, meta=true > {code} > The first line is where we remove the ephemeral node for the regionserver > carrying hbase:meta. The second line is supposed to log AFTER SCP is done (it > calls it SSH in this old test above). Notice how the 3rd line, after the 2nd, > is first mention of SCP being queued. -- This message was sent by Atlassian Jira (v8.3.4#803005)