haridsv opened a new pull request, #8057:
URL: https://github.com/apache/hbase/pull/8057

   Jira: [HBASE-30073](https://issues.apache.org/jira/browse/HBASE-30073)
   
   - Fixed `TestZstdDictionarySplitMerge` by adding a missing test dependency
   - Fixes for `TestBlockBytesScannedQuota`, `TestHRegionWithInMemoryFlush`, 
`TestZooKeeper`, `TestStochasticLoadBalancerRegionReplicaSameHosts` and 
`TestRollbackSCP` was provided by AI and verified by running 20 times each in a 
loop.
   
   Here are the AI summaries:
   
   ---
   
   # TestStochasticLoadBalancerRegionReplicaSameHosts
   
   ## **Root cause**
   
   `StochasticLoadBalancer.balanceTable()` set 
`cluster.setStopRequestedAt(startTime + maxRunningTime)` **at the very 
beginning** of the method, **before** build ing costs (`initCosts`), the first 
`computeCost`, `needsBalance`, and other setup.
   
   So the **entire** `maxRunningTime` budget (e.g. **250 ms** in 
`StochasticBalancerTestBase`) was counting **initialization + search**, not 
just the stochastic walk.
   
   On a **slow or busy CI host**, setup alone could take **~250 ms or more**. 
Then when the walk started, `isStopRequested()` was already true (or became 
true af ter a single rejected move). The loop could exit with **`step` still 
0**, **no accepted moves**, and `balanceCluster` returning **`null`** even 
though region repl icas were still colocated on the same host—so 
**`TestStochasticLoadBalancerRegionReplicaSameHosts`** failed. Locally, setup 
is usually only tens of ms, so the bug rarely shows up.
   
   ## **Fix**
   
   **Set `setStopRequestedAt` only after initialization is done**, immediately 
**before** the stochastic `for` loop, using a new timestamp (`searchStartTime + 
maxRunningTime`). That way **`maxRunningTime` applies to the search phase 
only**, which matches the intended meaning of the limit and avoids “no search 
time left” on loaded agents.
   
   # TestRollbackSCP
   
   ## **Root cause**
   
   **Surefire reruns failed tests in the same JVM** without running 
`@BeforeClass` again. The test used a **static** `INJECTED` flag with 
`compareAndSet(false, t rue)` so fault injection (and 
`setKillAndToggleBeforeStoreUpdateInRollback`) only ran **once per JVM**.
   
   After the **first** failure (e.g. `IllegalArgumentException: scheduler queue 
not empty` at `restartMasterProcedureExecutor`), **`INJECTED` stayed `true`**. 
On **reruns**, `compareAndSet` did nothing, so the executor was never killed in 
rollback and **`procExec.isRunning()` stayed true**, which made `waitFor(30000, 
() -> !procExec.isRunning())` **always time out**.
   
   So the “three different failures” were really: **first attempt** hit the 
real error at line 180; **attempts 2–3** were **misleading timeouts** at line 
177 caused by broken rerun state, not the original bug.
   
   ## *Fix*
   
   In **`@Before`** (which **does** run before each attempt, including reruns), 
reset test-only state:
   
   1. **`INJECTED.set(false)`** so injection can arm again on each run.
   2. 
**`ProcedureTestingUtility.setKillAndToggleBeforeStoreUpdateInRollback(procExec,
   false)`** so procedure-executor kill flags from the previous attempt are 
cleared.
   
   That makes each Surefire attempt behave like a fresh test with respect to
   injection, so reruns can pass when the underlying issue is fixed or flaky.
   
   # TestZooKeeper
   
   **Problem Identified:**
   - Line 78 had a duplicate call to `TEST_UTIL.startMiniDFSCluster(2)`
   - This violated the "can only create one" constraint and caused HDFS 
namespace conflicts when tests ran in parallel
   
   **Why it was flaky:**
   - ✅ **Local/isolated runs**: Worked because no leftover HDFS state
   - ❌ **CI/parallel runs**: Failed when other ZooKeeper tests ran first, 
leaving behind HDFS data with incompatible `namespaceID`
   
   **Fix Applied:**
   Removed the duplicate call on line 78. The test now properly:
   1. Starts Mini DFS Cluster (line 71)
   2. Starts Mini ZK Cluster (line 72)
   3. Configures settings (lines 73-77)
   4. Done! (no duplicate cluster start)
   
   This matches the pattern used in other similar tests like 
`TestClusterBootOrder` and aligns with the comment's intent: "Test we can first 
start the ZK cluster by itself."
   
   The test should now be stable across both local and CI environments. You can 
verify by running the test suite again to ensure `TestZooKeeper` no longer 
fails with `InconsistentFSStateException`.
   
   
   # TestHRegionWithInMemoryFlush
   
   **Root Cause Analysis**
   
   The TestHRegionWithInMemoryFlush test is flapping due to zombie background 
threads that continue running after tests timeout, causing directory race 
conditions with subsequent tests.
   
   **What's Happening**
   
   1. testParallelIncrementWithMemStoreFlush and 
testParallelAppendWithMemStoreFlush create background flush threads
   2. These threads run in a busy loop: while (!incrementDone.get())
   3. When the test times out (after ~4 minutes in CI), an InterruptedException 
is thrown during thread.join()
   4. The code never reaches the lines that set the done flag and stop the 
flush thread
   5. The zombie flush thread keeps running, busy-looping trying to flush
   6. Subsequent tests' tearDown() methods clean up directories
   7. The zombie thread tries to access deleted directories → race condition 
errors
   
   **Evidence**
   
   - Bad run shows only 5 tests completed (vs 136 in good run)
   - Thread-4940 logs "Skipping flush because closing" repeatedly for over a 
minute
   - 6 flaky errors about missing directories in unrelated tests like 
testGet_Basic and testScanner_JoinedScanners
   - Error messages: "Parent directory doesn't exist", "chmod: cannot access"
   
   **The Fix**
   
   I've fixed both problematic tests in 
/Users/hdara/src/apache/hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java:
   
   Changes made:
   1. Added interrupt check to flush thread loop: while (!done.get() && 
!Thread.currentThread().isInterrupted())
   2. Wrapped test logic in try-finally to ensure cleanup even on 
timeout/failure
   3. In finally block:
   - Set done flag to true
   - Interrupt the flush thread
   - Wait up to 5 seconds for thread to stop
   - Log warning if thread doesn't stop
   
   Files modified:
   - TestHRegion.java:5322-5368 - testParallelIncrementWithMemStoreFlush
   - TestHRegion.java:5407-5459 - testParallelAppendWithMemStoreFlush
   
   **Why This Only Happened in CI**
   
   CI environments are typically slower and more heavily loaded, causing tests 
to hit the 13-minute timeout threshold. Locally, the tests complete faster, so 
the threads finish normally before cleanup.
   
   The fix ensures that background threads are always properly stopped, 
preventing zombie threads from interfering with subsequent tests.
   
   # TestBlockBytesScannedQuota
   
   The test is flapping due to a **timing/race condition** in the quota system:
   
   1. **5-second timeout too short**: The `testTraffic` method only waited 5 
seconds for quotas to take effect
   2. **Quota cache not fully propagated**: On slower systems (like CI), the 
quota cache refresh can be asynchronous and may not fully propagate in time
   3. **Quotas bypassed**: When cache isn't refreshed, the logs show `"bypass 
expected false, actual true"`, meaning all requests succeed instead of being 
throttled
   4. **Insufficient retries**: Each iteration takes ~1.3 seconds, so only 3-4 
retries fit in 5 seconds, not enough for the quota system to stabilize
   
   **Bad run pattern:**
   - Test expects 1 successful request but gets 5 (all succeed because quotas 
not enforced)
   - Retries every ~1.3 seconds for 4 attempts
   - Times out after 5 seconds with "Waiting timed out after [5,000] msec"
   
   **Good run pattern:**
   - Quotas enforced immediately
   - Tests pass quickly (36.97s total vs 63.14s for failed run)
   
   Increased the timeout in `testTraffic()` from **5,000ms to 30,000ms** (line 
263). This gives the quota system sufficient time to:
   - Complete cache refresh
   - Propagate quota settings across all components
   - Handle slower CI environments
   
   This is a conservative fix that maintains the retry logic while allowing 
adequate time for the distributed quota system to stabilize. The 30-second 
timeout is still reasonable for a test and should handle the asynchronous 
nature of quota enforcement.
   
   # TestBlockBytesScannedQuota
   
   The test is flapping due to a **timing/race condition** in the quota system:
   
   1. **5-second timeout too short**: The `testTraffic` method only waited 5 
seconds for quotas to take effect
   2. **Quota cache not fully propagated**: On slower systems (like CI), the 
quota cache refresh can be asynchronous and may not fully propagate in time
   3. **Quotas bypassed**: When cache isn't refreshed, the logs show `"bypass 
expected false, actual true"`, meaning all requests succeed instead of being 
throttled
   4. **Insufficient retries**: Each iteration takes ~1.3 seconds, so only 3-4 
retries fit in 5 seconds, not enough for the quota system to stabilize
   
   **Bad run pattern:**
   - Test expects 1 successful request but gets 5 (all succeed because quotas 
not enforced)
   - Retries every ~1.3 seconds for 4 attempts
   - Times out after 5 seconds with "Waiting timed out after [5,000] msec"
   
   **Good run pattern:**
   - Quotas enforced immediately
   - Tests pass quickly (36.97s total vs 63.14s for failed run)
   
   Increased the timeout in `testTraffic()` from **5,000ms to 30,000ms** (line 
263). This gives the quota system sufficient time to:
   - Complete cache refresh
   - Propagate quota settings across all components
   - Handle slower CI environments
   
   This is a conservative fix that maintains the retry logic while allowing 
adequate time for the distributed quota system to stabilize. The 30-second 
timeout is still reasonable for a test and should handle the asynchronous 
nature of quota enforcement.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to