Appy created HBASE-19335: ---------------------------- Summary: Fix waitUntilAllRegionsAssigned Key: HBASE-19335 URL: https://issues.apache.org/jira/browse/HBASE-19335 Project: HBase Issue Type: Bug Reporter: Appy Assignee: Appy
Found when debugging flaky test TestRegionObserverInterface#testRecovery. In the end, the test does the following: - Kills the RS - Waits for all regions to be assigned - Some validation (unrelated) - Cleanup: delete table. {noformat} cluster.killRegionServer(rs1.getRegionServer().getServerName()); Threads.sleep(1000); // Let the kill soak in. util.waitUntilAllRegionsAssigned(tableName); LOG.info("All regions assigned"); verifyMethodResult(SimpleRegionObserver.class, new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs", "getCtPreWALRestore", "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" }, tableName, new Integer[] { 1, 1, 2, 2, 0, 0 }); } finally { util.deleteTable(tableName); table.close(); } } {noformat} However, looking at test logs, found that we had overlapping Assigns with Unassigns. As a result, regions ended up 'stuck in RIT' and the test timeout. Assigns were from the ServerCrashRecovery and Unassigns were from the deleteTable cleanup. Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName) not wait until recovery was complete. Answer: Looks like that function is only meant for sunny scenarios but not for crashes. It iterates over meta and just [checks for *some value* in the server column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421] which is obviously present and equal to the server that was just killed. This bug must be affecting other fault tolerance tests too and fixing it may fix more than just one test, hopefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029)