Jerry He created HBASE-13526:
--------------------------------

             Summary: TestRegionServerReportForDuty can be flaky: hang or 
timeout
                 Key: HBASE-13526
                 URL: https://issues.apache.org/jira/browse/HBASE-13526
             Project: HBase
          Issue Type: Bug
          Components: test
    Affects Versions: 0.98.12, 2.0.0, 1.1.0
            Reporter: Jerry He
            Assignee: Jerry He
            Priority: Minor


This test case is from HBASE-13317.
The test uses a custom region server to simulate reportForDuty in a master 
failover case.  This custom RS would start, then the primary master would fail, 
then the custom RS would  reportForDuty to the second master after master 
failover.

The test occasionally will hang or timeout.
The root cause is that during first master initialization, the master would 
assign meta (and create and assign namespace table). It is possible that the 
meta is assigned to the custom RS, which has started (place a rs node on the 
ZK), but will not really check-in and be online. Then the master will go thru 
multiple re-assignment, which can be lengthy and cause trouble.

There are a couple of issues I see in the master assignment code:
1.  Master puts all the region servers obtained from ZK rs node into the online 
server list, including those that have not checked-in via RPC.  And we will 
assign meta or other regions based on whole list.
2. When one assign plan fails, we don't exclude the failed server when picking 
the next destination, which may prolong the assignment process.

I will provide a patch to fix the test case.  The other issues mentioned are up 
to discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to