SwaraliJoshi opened a new pull request, #8356:
URL: https://github.com/apache/hbase/pull/8356
## Summary
`TestRollbackSCP.testFailAndRollback` is flaky: it intermittently fails with
`java.lang.IllegalArgumentException: scheduler queue not empty` from
`ProcedureExecutor.load()` while restarting the master procedure executor.
The test restarts the `ProcedureExecutor` **in place**, reusing the same
executor and `MasterProcedureScheduler` instances to simulate a failover.
While
the executor is being reloaded, other still-running threads of the live
mini-cluster master can push a procedure back into the shared scheduler in
the
small window between `scheduler.clear()` and `ProcedureExecutor.load()`'s
`Preconditions.checkArgument(scheduler.size() == 0, ...)`.
Two producers were identified:
- the `asyncTaskExecutor` callback that wakes a procedure after an async meta
update (e.g. `AssignmentManager.persistToMeta`), and
- an incoming `reportRegionStateTransition` RPC from a live region server,
handled on an `RpcServer` handler thread, which wakes a procedure through
`ProcedureEvent.wake` -> `scheduler.addFront`.
This is a test-infrastructure issue: a real master failover starts a fresh
process with a fresh executor/scheduler, so the production `load()`
precondition
is not affected. The fix is therefore confined to test code.
## Changes (`ProcedureTestingUtility.restart()`, test-only)
- Wait for the already shut-down `asyncTaskExecutor` to fully terminate
before
clearing the scheduler, so any pending async wake-up callback has finished
(closes the dominant, async producer deterministically).
- Reload (`clear` -> `procStore.start` -> `init`) in a bounded retry loop,
retrying only when `load()` fails specifically with `scheduler queue not
empty`. `ProcedureExecutor.stop()` is explicitly safe to call after a
failed
`init()`, so this is a clean redo and is robust to any external producer.
## Test plan
- [x] Ran `TestRollbackSCP` 100x consecutively with the fix: 100/100 passed
(twice).
- [x] Control: reverted the fix and reran: reproduced the failure within a
few iterations.
- [x] Regression: ran affected `hbase-procedure` tests and a representative
`hbase-server` subset (TestSCP, TestProcedureAdmin,
TestTransitRegionStateProcedure, TestCreateTableProcedure): all pass.
Made with [Cursor](https://cursor.com)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]