Appy created HBASE-19335:
----------------------------

             Summary: Fix waitUntilAllRegionsAssigned
                 Key: HBASE-19335
                 URL: https://issues.apache.org/jira/browse/HBASE-19335
             Project: HBase
          Issue Type: Bug
            Reporter: Appy
            Assignee: Appy


Found when debugging flaky test TestRegionObserverInterface#testRecovery.
In the end, the test does the following:
- Kills the RS
- Waits for all regions to be assigned
- Some validation (unrelated)
- Cleanup: delete table.
{noformat}
      cluster.killRegionServer(rs1.getRegionServer().getServerName());
      Threads.sleep(1000); // Let the kill soak in.
      util.waitUntilAllRegionsAssigned(tableName);
      LOG.info("All regions assigned");

      verifyMethodResult(SimpleRegionObserver.class,
        new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs", 
"getCtPreWALRestore",
            "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" },
        tableName, new Integer[] { 1, 1, 2, 2, 0, 0 });
    } finally {
      util.deleteTable(tableName);
      table.close();
    }
  }
{noformat}

However, looking at test logs, found that we had overlapping Assigns with 
Unassigns. As a result, regions ended up 'stuck in RIT' and the test timeout.
Assigns were from the ServerCrashRecovery and Unassigns were from the 
deleteTable cleanup.
Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName) 
not wait until recovery was complete.

Answer: Looks like that function is only meant for sunny scenarios but not for 
crashes. It iterates over meta and just [checks for *some value* in the server 
column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421]
 which is obviously present and equal to the server that was just killed.

This bug must be affecting other fault tolerance tests too and fixing it may 
fix more than just one test, hopefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to