Duo Zhang created HBASE-30094:
---------------------------------
Summary: Flaky tests rerun may enter incorrect state
Key: HBASE-30094
URL: https://issues.apache.org/jira/browse/HBASE-30094
Project: HBase
Issue Type: Sub-task
Reporter: Duo Zhang
Sonnet 4.5(4.6?) summary for TestRollbackSCP
Root cause
Surefire reruns failed tests in the same JVM without running @BeforeClass
again. The test used a static INJECTED flag with compareAndSet(false, t rue) so
fault injection (and setKillAndToggleBeforeStoreUpdateInRollback) only ran once
per JVM.
After the first failure (e.g. IllegalArgumentException: scheduler queue not
empty at restartMasterProcedureExecutor), INJECTED stayed true. On reruns,
compareAndSet did nothing, so the executor was never killed in rollback and
procExec.isRunning() stayed true, which made waitFor(30000, () ->
!procExec.isRunning()) always time out.
So the “three different failures” were really: first attempt hit the real error
at line 180; attempts 2–3 were misleading timeouts at line 177 caused by broken
rerun state, not the original bug.
Fix
In @Before (which does run before each attempt, including reruns), reset
test-only state:
INJECTED.set(false) so injection can arm again on each run.
ProcedureTestingUtility.setKillAndToggleBeforeStoreUpdateInRollback(procExec,
false) so procedure-executor kill flags from the previous attempt are cleared.
That makes each Surefire attempt behave like a fresh test with respect to
injection, so reruns can pass when the underlying issue is fixed or flaky.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)