[ 
https://issues.apache.org/jira/browse/HBASE-10924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandr Shulman updated HBASE-10924:
--------------------------------------

    Status: Patch Available  (was: Open)

> [region_mover]: Adjust region_mover script to retry unloading a server a 
> configurable number of times in case of region splits/merges
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10924
>                 URL: https://issues.apache.org/jira/browse/HBASE-10924
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.94.15
>            Reporter: Aleksandr Shulman
>            Assignee: Aleksandr Shulman
>              Labels: region_mover, rolling_upgrade
>             Fix For: 0.94.20
>
>         Attachments: HBASE-10924-0.94-v2.patch, HBASE-10924-0.94-v3.patch
>
>
> Observed behavior:
> In about 5% of cases, my rolling upgrade tests fail because of stuck regions 
> during a region server unload. My theory is that this occurs when region 
> assignment information changes between the time the region list is generated, 
> and the time when the region is to be moved.
> An example of such a region information change is a split or merge.
> Example:
> Regionserver A has 100 regions (#0-#99). The balancer is turned off and the 
> regionmover script is called to unload this regionserver. The regionmover 
> script will generate the list of 100 regions to be moved and then proceed 
> down that list, moving the regions off in series. However, there is a region, 
> #84, that has split into two daughter regions while regions 0-83 were moved. 
> The script will be stuck trying to move #84, timeout, and then the failure 
> will bubble up (attempt 1 failed).
> Proposed solution:
> This specific failure mode should be caught and the region_mover script 
> should now attempt to move off all the regions. Now, it will have 16+1 (due 
> to split) regions to move. There is a good chance that it will be able to 
> move all 17 off without issues. However, should it encounter this same issue 
> (attempt 2 failed), it will retry again. This process will continue until the 
> maximum number of unload retry attempts has been reached.
> This is not foolproof, but let's say for the sake of argument that 5% of 
> unload attempts hit this issue, then with a retry count of 3, it will reduce 
> the unload failure probability from 0.05 to 0.000125 (0.05^3).
> Next steps:
> I am looking for feedback on this approach. If it seems like a sensible 
> approach, I will create a strawman patch and test it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to