[
https://issues.apache.org/jira/browse/HBASE-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085942#comment-18085942
]
rstest commented on HBASE-30201:
--------------------------------
Potential fix direction:
- In `checkIfShouldMoveSystemRegionAsync()`, do not call the normal
`moveAsync(...)` path for `hbase:meta` when the region is already in transition
or when the source/target server is dead, queued-dead, or currently being
handled by `ServerCrashProcedure`.
- If `hbase:meta` already has an active procedure, defer the compatibility move
and retry after the procedure completes instead of treating the normal move
failure as the recovery path.
- Add a regression test where an upgraded HMaster sees a newly registered
RegionServer while the old RegionServer hosting `hbase:meta` has failed during
`OPENING`; the test should assert that `hbase:meta` is eventually assigned and
that the compatibility move path does not interfere with crash recovery.
> HBase rolling upgrade buggy on `hbases:meta` crash recovery
> -----------------------------------------------------------
>
> Key: HBASE-30201
> URL: https://issues.apache.org/jira/browse/HBASE-30201
> Project: HBase
> Issue Type: Bug
> Components: master, regionserver
> Affects Versions: 4.0.0-alpha-1, 2.6.4
> Reporter: rstest
> Priority: Major
>
> h1. Summary
> HBase rolling upgrade can race system-region compatibility movement with meta
> crash recovery, leaving `hbase:meta` stuck in `OPENING`
> h1. Bug Symptom
> During a rolling upgrade from HBase 2.6.4 to HBase 4.0.0-alpha-1-SNAPSHOT,
> the upgraded HMaster can try to move `hbase:meta` through the normal
> system-region compatibility path while the RegionServer hosting `hbase:meta`
> has just failed and meta is already being opened/recovered by another
> assignment procedure.
> The observed sequence is:
> - A 3-node HBase 2.6.4 cluster starts normally.
> - Node0, the HMaster, is upgraded to HBase 4.0.0-alpha-1-SNAPSHOT.
> - Node2, `hregion2`, is killed shortly after the Node0 upgrade.
> - The raw error identifies `hregion2` as the location of `hbase:meta`.
> - The upgraded HMaster runs
> `AssignmentManager.checkIfShouldMoveSystemRegionAsync()`, which moves system
> regions toward higher-version RegionServers during mixed-version operation.
> - That path calls the normal `moveAsync(...)` path for `hbase:meta`.
> - `hbase:meta` is already `OPENING` and already has an active assignment
> procedure.
> - `AssignmentManager.preTransitCheck(...)` rejects the normal move attempt
> because the region has an active procedure.
>
> Expected behavior:
> - If the server hosting `hbase:meta` dies during rolling upgrade, meta
> recovery should be handled by `ServerCrashProcedure` or the active meta
> assignment procedure.
> - The system-region compatibility move path should defer, skip, or retry
> later when `hbase:meta` is already in transition.
> - The master should eventually assign/recover `hbase:meta` without getting
> stuck behind a rejected normal move.
> Actual behavior:
> - The upgraded HMaster logs an HBase product error for `hbase:meta` in
> `OPENING`.
> - The normal move path is rejected while `hbase:meta` still has an active
> procedure.
> - The observed failure is only reported in the old-new rolling lane for this
> test plan, not as a normal old-old or new-new baseline behavior.
> Representative exception:
> {code:java}
> 2026-03-07T05:07:45,259 ERROR [Thread-38] assignment.AssignmentManager:
> org.apache.hadoop.hbase.HBaseIOException: state=OPENING,
> location=hregion2,16020,1772859875787, table=hbase:meta,
> region=1588230740 is currently in transition; pid=49
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:766)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:879)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:896)
> {code}
> Relevant code path in the upgraded version:
> - `AssignmentManager.checkIfShouldMoveSystemRegionAsync()` starts a
> background compatibility check that moves system table regions to newer
> RegionServers.
> - The method comments already call out the killed-server race: if a server
> is killed and a new one starts, this thread can think it should move system
> tables while `ServerCrashProcedure` is responsible for assignment recovery.
> - For `hbase:meta`, the method calls `moveAsync(plan)` immediately.
> - `moveAsync(...)` reaches `createMoveRegionProcedure(...)`.
> - `createMoveRegionProcedure(...)` calls `preTransitCheck(...)`.
> - `preTransitCheck(...)` throws if `regionNode.getProcedure() != null`,
> which is exactly the observed `pid=49` state.
> This is not just a harmless log-message mismatch. `hbase:meta` is the
> metadata table used to locate user regions, so leaving it in an unresolved
> `OPENING`/in-transition state can make client and admin operations unable to
> locate regions even while cluster processes are still running.
> h2. How To Reproduce
> One way to reproduce is to trigger the compatibility-move and crash-recovery
> race for `hbase:meta` during a mixed-version rolling upgrade.
> 1. Start a 3-node HBase 2.6.4 cluster with one HMaster and two RegionServers.
> 2. Run workload that creates normal table/namespace state so the cluster has
> active assignment and meta activity.
> 3. Start a rolling upgrade from HBase 2.6.4 to HBase 4.0.0-alpha-1-SNAPSHOT.
> 4. Upgrade the HMaster node first.
> 5. Shortly after the upgraded HMaster is running, kill the RegionServer
> currently hosting or opening `hbase:meta`.
> In the observed run this was Node2, `hregion2`.
> 6. Continue the rolling upgrade so other RegionServers register at the new
> version.
> 7. Observe that the upgraded HMaster's system-region compatibility check
> tries to move `hbase:meta` through the normal `moveAsync(...)` path while
> `hbase:meta` is already `OPENING` with an active procedure.
> 8. Check the HMaster logs for:
> {code:java}
> table=hbase:meta ... state=OPENING ... is currently in transition;
> pid=...{code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)