[ https://issues.apache.org/jira/browse/HBASE-10924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aleksandr Shulman updated HBASE-10924: -------------------------------------- Attachment: HBASE-10924-0.94-v2.patch Attaching a better version of the patch here. It's relatively straightforward, but if there is interest in a formal review, I can put it up on RB. Testing: I ran this patch through an in-house rolling upgrade test framework. It performs MR jobs, splits, compactions, and DML while regions are moving. I also did some explicit testing by installing this on a cluster and moving regions back and forth while doing splits. The results were fine for all the testing. > [region_mover]: Adjust region_mover script to retry unloading a server a > configurable number of times in case of region splits/merges > ------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-10924 > URL: https://issues.apache.org/jira/browse/HBASE-10924 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 0.94.15 > Reporter: Aleksandr Shulman > Assignee: Aleksandr Shulman > Labels: region_mover, rolling_upgrade > Fix For: 0.94.20 > > Attachments: HBASE-10924-0.94-v2.patch > > > Observed behavior: > In about 5% of cases, my rolling upgrade tests fail because of stuck regions > during a region server unload. My theory is that this occurs when region > assignment information changes between the time the region list is generated, > and the time when the region is to be moved. > An example of such a region information change is a split or merge. > Example: > Regionserver A has 100 regions (#0-#99). The balancer is turned off and the > regionmover script is called to unload this regionserver. The regionmover > script will generate the list of 100 regions to be moved and then proceed > down that list, moving the regions off in series. However, there is a region, > #84, that has split into two daughter regions while regions 0-83 were moved. > The script will be stuck trying to move #84, timeout, and then the failure > will bubble up (attempt 1 failed). > Proposed solution: > This specific failure mode should be caught and the region_mover script > should now attempt to move off all the regions. Now, it will have 16+1 (due > to split) regions to move. There is a good chance that it will be able to > move all 17 off without issues. However, should it encounter this same issue > (attempt 2 failed), it will retry again. This process will continue until the > maximum number of unload retry attempts has been reached. > This is not foolproof, but let's say for the sake of argument that 5% of > unload attempts hit this issue, then with a retry count of 3, it will reduce > the unload failure probability from 0.05 to 0.000125 (0.05^3). > Next steps: > I am looking for feedback on this approach. If it seems like a sensible > approach, I will create a strawman patch and test it. -- This message was sent by Atlassian JIRA (v6.2#6252)