SwaraliJoshi opened a new pull request, #8356:
URL: https://github.com/apache/hbase/pull/8356

   ## Summary
   
   `TestRollbackSCP.testFailAndRollback` is flaky: it intermittently fails with
   `java.lang.IllegalArgumentException: scheduler queue not empty` from
   `ProcedureExecutor.load()` while restarting the master procedure executor.
   
   The test restarts the `ProcedureExecutor` **in place**, reusing the same
   executor and `MasterProcedureScheduler` instances to simulate a failover. 
While
   the executor is being reloaded, other still-running threads of the live
   mini-cluster master can push a procedure back into the shared scheduler in 
the
   small window between `scheduler.clear()` and `ProcedureExecutor.load()`'s
   `Preconditions.checkArgument(scheduler.size() == 0, ...)`.
   
   Two producers were identified:
   - the `asyncTaskExecutor` callback that wakes a procedure after an async meta
     update (e.g. `AssignmentManager.persistToMeta`), and
   - an incoming `reportRegionStateTransition` RPC from a live region server,
     handled on an `RpcServer` handler thread, which wakes a procedure through
     `ProcedureEvent.wake` -> `scheduler.addFront`.
   
   This is a test-infrastructure issue: a real master failover starts a fresh
   process with a fresh executor/scheduler, so the production `load()` 
precondition
   is not affected. The fix is therefore confined to test code.
   
   ## Changes (`ProcedureTestingUtility.restart()`, test-only)
   
   - Wait for the already shut-down `asyncTaskExecutor` to fully terminate 
before
     clearing the scheduler, so any pending async wake-up callback has finished
     (closes the dominant, async producer deterministically).
   - Reload (`clear` -> `procStore.start` -> `init`) in a bounded retry loop,
     retrying only when `load()` fails specifically with `scheduler queue not
     empty`. `ProcedureExecutor.stop()` is explicitly safe to call after a 
failed
     `init()`, so this is a clean redo and is robust to any external producer.
   
   ## Test plan
   
   - [x] Ran `TestRollbackSCP` 100x consecutively with the fix: 100/100 passed 
(twice).
   - [x] Control: reverted the fix and reran: reproduced the failure within a 
few iterations.
   - [x] Regression: ran affected `hbase-procedure` tests and a representative
         `hbase-server` subset (TestSCP, TestProcedureAdmin,
         TestTransitRegionStateProcedure, TestCreateTableProcedure): all pass.
   
   Made with [Cursor](https://cursor.com)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to