[ 
https://issues.apache.org/jira/browse/HBASE-10924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040236#comment-14040236
 ] 

Aleksandr Shulman commented on HBASE-10924:
-------------------------------------------

Hmm - I think it's still the intention of the patch to have region_mover do a 
best-effort move of all the regions, as the script had done before. The main 
addition is that it will retry that process a configurable number of times, in 
case of strange transient conditions we've seen, like the master down when the 
move request is sent.

Overall, I've seen the region_mover work pretty well and I see this patch as 
just being a minor stability improvement. If you believe there is a better way 
to do this region movement, such as failing fast on a split region, I'd be 
happy to test such a patch in our frameworks.

If we're happy with the logic of this patch, then I can post a version for 
0.96, 0.98, trunk, etc.

> [region_mover]: Adjust region_mover script to retry unloading a server a 
> configurable number of times in case of region splits/merges
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10924
>                 URL: https://issues.apache.org/jira/browse/HBASE-10924
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.94.15
>            Reporter: Aleksandr Shulman
>            Assignee: Aleksandr Shulman
>              Labels: region_mover, rolling_upgrade
>             Fix For: 0.94.22
>
>         Attachments: HBASE-10924-0.94-v2.patch, HBASE-10924-0.94-v3.patch
>
>
> Observed behavior:
> In about 5% of cases, my rolling upgrade tests fail because of stuck regions 
> during a region server unload. My theory is that this occurs when region 
> assignment information changes between the time the region list is generated, 
> and the time when the region is to be moved.
> An example of such a region information change is a split or merge.
> Example:
> Regionserver A has 100 regions (#0-#99). The balancer is turned off and the 
> regionmover script is called to unload this regionserver. The regionmover 
> script will generate the list of 100 regions to be moved and then proceed 
> down that list, moving the regions off in series. However, there is a region, 
> #84, that has split into two daughter regions while regions 0-83 were moved. 
> The script will be stuck trying to move #84, timeout, and then the failure 
> will bubble up (attempt 1 failed).
> Proposed solution:
> This specific failure mode should be caught and the region_mover script 
> should now attempt to move off all the regions. Now, it will have 16+1 (due 
> to split) regions to move. There is a good chance that it will be able to 
> move all 17 off without issues. However, should it encounter this same issue 
> (attempt 2 failed), it will retry again. This process will continue until the 
> maximum number of unload retry attempts has been reached.
> This is not foolproof, but let's say for the sake of argument that 5% of 
> unload attempts hit this issue, then with a retry count of 3, it will reduce 
> the unload failure probability from 0.05 to 0.000125 (0.05^3).
> Next steps:
> I am looking for feedback on this approach. If it seems like a sensible 
> approach, I will create a strawman patch and test it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to