[ 
https://issues.apache.org/jira/browse/HBASE-28533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Roudnitsky updated HBASE-28533:
--------------------------------------
    Description: 
Depending on where the split procedure fails, SplitTableRegionProcedure 
rollback can leave the parent region's in memory RegionStateNode in SPLITTING 
after rollback is complete, when the parent region is still online on the 
assigned region server and according to meta. This leaves active HMaster 
believing that the parent region is offline according to its RegionStates, and 
causes subsequent procedures that require the region to be online like 
merge/split/move to fail to start. One workaround is to restart active HMaster 
to reset the in memory record of region states. 

Two example scenarios where this can happen:
 * If we get to SPLIT_TABLE_REGION_CLOSE_PARENT_REGION and the parent region 
has a replica which is in transition, the unassign procedure in that step is 
never created/rolled back and we are left with the parent region state in 
splitting.
 * If region quotas are enabled and a split is run for a region whose namespace 
is at its maximum region quota limit we will fail in 
SPLIT_TABLE_REGION_PRE_OPERATION with QuotaExceededException and we are left 
with the parent region state in splitting

The region replica case is demonstrated in 
TestSplitTableRegionProcedure.testRollbackForSplitTableRegionWithReplica.

To reproduce the region quota case in HBase shell:
{code:java}
> create_namespace 'test_ns', {'hbase.namespace.quota.maxregions'=> 2}
> create 'test_ns:test_table', 'f1', {NUMREGIONS => 2, SPLITALGO => 
> 'UniformSplit'}
> region_a = <first region from list_regions 'test_ns:test_table'>
> region_b = <second region from list_regions 'test_ns:test_table'>

> split region_a, 'x'
# HMaster will report: 
pid=405, state=ROLLEDBACK, 
exception=org.apache.hadoop.hbase.quotas.QuotaExceededException via 
master-split-regions:org.apache.hadoop.hbase.quotas.QuotaExceededException: 
Region split not possible for :<region_a> as quota limits are exceeded ; 
SplitTableRegionProcedure table=test_ns:test_table, parent=...

> merge_region region_a, region_b
ERROR: org.apache.hadoop.hbase.exceptions.MergeRegionException: 
org.apache.hadoop.hbase.client.DoNotRetryRegionException: <region_a> is not 
OPEN; state=SPLITTING

> stop_master # trigger hmaster failover 
> merge_region region_a, region_b # merge now succeeds {code}

  was:
Depending on where the split procedure fails, SplitTableRegionProcedure 
rollback can leave the parent region's RegionStateNode in SPLITTING after 
rollback is complete, when the parent region is still online on the assigned 
region server and according to meta. This leaves active HMaster believing that 
the parent region is offline according to its RegionStates, and causes 
subsequent procedures that require the region to be online like 
merge/split/move to fail to start. One workaround is to restart active HMaster 
to reset the in memory record of region states. 

Two example scenarios where this can happen:
 * If we get to SPLIT_TABLE_REGION_CLOSE_PARENT_REGION and the parent region 
has a replica which is in transition, the unassign procedure in that step is 
never created/rolled back and we are left with the parent region state in 
splitting.
 * If region quotas are enabled and a split is run for a region whose namespace 
is at its maximum region quota limit we will fail in 
SPLIT_TABLE_REGION_PRE_OPERATION with QuotaExceededException and we are left 
with the parent region state in splitting

The region replica case is demonstrated in 
TestSplitTableRegionProcedure.testRollbackForSplitTableRegionWithReplica.

To reproduce the region quota case in HBase shell:
{code:java}
> create_namespace 'test_ns', {'hbase.namespace.quota.maxregions'=> 2}
> create 'test_ns:test_table', 'f1', {NUMREGIONS => 2, SPLITALGO => 
> 'UniformSplit'}
> region_a = <first region from list_regions 'test_ns:test_table'>
> region_b = <second region from list_regions 'test_ns:test_table'>

> split region_a, 'x'
# HMaster will report: 
pid=405, state=ROLLEDBACK, 
exception=org.apache.hadoop.hbase.quotas.QuotaExceededException via 
master-split-regions:org.apache.hadoop.hbase.quotas.QuotaExceededException: 
Region split not possible for :<region_a> as quota limits are exceeded ; 
SplitTableRegionProcedure table=test_ns:test_table, parent=...

> merge_region region_a, region_b
ERROR: org.apache.hadoop.hbase.exceptions.MergeRegionException: 
org.apache.hadoop.hbase.client.DoNotRetryRegionException: <region_a> is not 
OPEN; state=SPLITTING

> stop_master # trigger hmaster failover 
> merge_region region_a, region_b # merge now succeeds {code}


> Split procedure rollback can leave parent region state in SPLITTING after 
> completion
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-28533
>                 URL: https://issues.apache.org/jira/browse/HBASE-28533
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.5.8, 3.0.0-beta-2
>            Reporter: Daniel Roudnitsky
>            Assignee: Daniel Roudnitsky
>            Priority: Major
>              Labels: pull-request-available
>
> Depending on where the split procedure fails, SplitTableRegionProcedure 
> rollback can leave the parent region's in memory RegionStateNode in SPLITTING 
> after rollback is complete, when the parent region is still online on the 
> assigned region server and according to meta. This leaves active HMaster 
> believing that the parent region is offline according to its RegionStates, 
> and causes subsequent procedures that require the region to be online like 
> merge/split/move to fail to start. One workaround is to restart active 
> HMaster to reset the in memory record of region states. 
> Two example scenarios where this can happen:
>  * If we get to SPLIT_TABLE_REGION_CLOSE_PARENT_REGION and the parent region 
> has a replica which is in transition, the unassign procedure in that step is 
> never created/rolled back and we are left with the parent region state in 
> splitting.
>  * If region quotas are enabled and a split is run for a region whose 
> namespace is at its maximum region quota limit we will fail in 
> SPLIT_TABLE_REGION_PRE_OPERATION with QuotaExceededException and we are left 
> with the parent region state in splitting
> The region replica case is demonstrated in 
> TestSplitTableRegionProcedure.testRollbackForSplitTableRegionWithReplica.
> To reproduce the region quota case in HBase shell:
> {code:java}
> > create_namespace 'test_ns', {'hbase.namespace.quota.maxregions'=> 2}
> > create 'test_ns:test_table', 'f1', {NUMREGIONS => 2, SPLITALGO => 
> > 'UniformSplit'}
> > region_a = <first region from list_regions 'test_ns:test_table'>
> > region_b = <second region from list_regions 'test_ns:test_table'>
> > split region_a, 'x'
> # HMaster will report: 
> pid=405, state=ROLLEDBACK, 
> exception=org.apache.hadoop.hbase.quotas.QuotaExceededException via 
> master-split-regions:org.apache.hadoop.hbase.quotas.QuotaExceededException: 
> Region split not possible for :<region_a> as quota limits are exceeded ; 
> SplitTableRegionProcedure table=test_ns:test_table, parent=...
> > merge_region region_a, region_b
> ERROR: org.apache.hadoop.hbase.exceptions.MergeRegionException: 
> org.apache.hadoop.hbase.client.DoNotRetryRegionException: <region_a> is not 
> OPEN; state=SPLITTING
> > stop_master # trigger hmaster failover 
> > merge_region region_a, region_b # merge now succeeds {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to