Aleksandr Shulman created HBASE-10924:
-----------------------------------------

             Summary: [region_mover]: Adjust region_mover script to retry 
unloading a server a configurable number of times in case of region 
splits/merges
                 Key: HBASE-10924
                 URL: https://issues.apache.org/jira/browse/HBASE-10924
             Project: HBase
          Issue Type: Bug
          Components: Region Assignment
    Affects Versions: 0.94.15
            Reporter: Aleksandr Shulman
            Assignee: Aleksandr Shulman
             Fix For: 0.94.19


Observed behavior:
In about 5% of cases, my rolling upgrade tests fail because of stuck regions 
during a region server unload. My theory is that this occurs when region 
assignment information changes between the time the region list is generated, 
and the time when the region is to be moved.

An example of such a region information change is a split or merge.

Example:
Regionserver A has 100 regions (#0-#99). The balancer is turned off and the 
regionmover script is called to unload this regionserver. The regionmover 
script will generate the list of 100 regions to be moved and then proceed down 
that list, moving the regions off in series. However, there is a region, #84, 
that has split into two daughter regions while regions 0-83 were moved. The 
script will be stuck trying to move #84, timeout, and then the failure will 
bubble up (attempt 1 failed).

Proposed solution:
This specific failure mode should be caught and the region_mover script should 
now attempt to move off all the regions. Now, it will have 16+1 (due to split) 
regions to move. There is a good chance that it will be able to move all 17 off 
without issues. However, should it encounter this same issue (attempt 2 
failed), it will retry again. This process will continue until the maximum 
number of unload retry attempts has been reached.

This is not foolproof, but let's say for the sake of argument that 5% of unload 
attempts hit this issue, then with a retry count of 3, it will reduce the 
unload failure probability from 0.05 to 0.000125 (0.05^3).

Next steps:
I am looking for feedback on this approach. If it seems like a sensible 
approach, I will create a strawman patch and test it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to