[ https://issues.apache.org/jira/browse/HBASE-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264372#comment-16264372 ]
Chia-Ping Tsai edited comment on HBASE-19335 at 11/23/17 2:03 PM: ------------------------------------------------------------------ Ran {{TestRegionObserverInterface}} with patch 50 times. All pass. was (Author: chia7712): Ran {{TestRegionObserverInterface}} with patch 50 times. All passes. > Fix waitUntilAllRegionsAssigned > ------------------------------- > > Key: HBASE-19335 > URL: https://issues.apache.org/jira/browse/HBASE-19335 > Project: HBase > Issue Type: Bug > Reporter: Appy > Assignee: Appy > Attachments: HBASE-19335.master.001.patch > > > Found when debugging flaky test TestRegionObserverInterface#testRecovery. > In the end, the test does the following: > - Kills the RS > - Waits for all regions to be assigned > - Some validation (unrelated) > - Cleanup: delete table. > {noformat} > cluster.killRegionServer(rs1.getRegionServer().getServerName()); > Threads.sleep(1000); // Let the kill soak in. > util.waitUntilAllRegionsAssigned(tableName); > LOG.info("All regions assigned"); > verifyMethodResult(SimpleRegionObserver.class, > new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs", > "getCtPreWALRestore", > "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" }, > tableName, new Integer[] { 1, 1, 2, 2, 0, 0 }); > } finally { > util.deleteTable(tableName); > table.close(); > } > } > {noformat} > However, looking at test logs, found that we had overlapping Assigns with > Unassigns. As a result, regions ended up 'stuck in RIT' and the test timeout. > Assigns were from the ServerCrashRecovery and Unassigns were from the > deleteTable cleanup. > Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName) > not wait until recovery was complete. > Answer: Looks like that function is only meant for sunny scenarios but not > for crashes. It iterates over meta and just [checks for *some value* in the > server > column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421] > which is obviously present and equal to the server that was just killed. > This bug must be affecting other fault tolerance tests too and fixing it may > fix more than just one test, hopefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029)