Duo Zhang created HBASE-30094:
---------------------------------

             Summary: Flaky tests rerun may enter incorrect state
                 Key: HBASE-30094
                 URL: https://issues.apache.org/jira/browse/HBASE-30094
             Project: HBase
          Issue Type: Sub-task
            Reporter: Duo Zhang


Sonnet 4.5(4.6?) summary for TestRollbackSCP

Root cause
Surefire reruns failed tests in the same JVM without running @BeforeClass 
again. The test used a static INJECTED flag with compareAndSet(false, t rue) so 
fault injection (and setKillAndToggleBeforeStoreUpdateInRollback) only ran once 
per JVM.

After the first failure (e.g. IllegalArgumentException: scheduler queue not 
empty at restartMasterProcedureExecutor), INJECTED stayed true. On reruns, 
compareAndSet did nothing, so the executor was never killed in rollback and 
procExec.isRunning() stayed true, which made waitFor(30000, () -> 
!procExec.isRunning()) always time out.

So the “three different failures” were really: first attempt hit the real error 
at line 180; attempts 2–3 were misleading timeouts at line 177 caused by broken 
rerun state, not the original bug.

Fix
In @Before (which does run before each attempt, including reruns), reset 
test-only state:

INJECTED.set(false) so injection can arm again on each run.
ProcedureTestingUtility.setKillAndToggleBeforeStoreUpdateInRollback(procExec, 
false) so procedure-executor kill flags from the previous attempt are cleared.
That makes each Surefire attempt behave like a fresh test with respect to
injection, so reruns can pass when the underlying issue is fixed or flaky.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to