This is an automated email from the ASF dual-hosted git repository.

Apache9 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hbase.git


The following commit(s) were added to refs/heads/master by this push:
     new de30a2bdb08 HBASE-30111 Fix for flapper TestBlockBytesScannedQuota by 
AI (#8117)
de30a2bdb08 is described below

commit de30a2bdb08326ec69195bf0da52e4fa2d18311a
Author: Hari Krishna Dara <[email protected]>
AuthorDate: Fri Apr 24 20:39:57 2026 +0530

    HBASE-30111 Fix for flapper TestBlockBytesScannedQuota by AI (#8117)
    
    AI tool: Claude Code
    AI Model: Sonnet 4.5 (reverified using Opus 4.6 (1M context))
    Prompt: You need to diagnose the HBase test TestBlockBytesScannedQuota that 
is flapping. It had multiple failures and errors in the PR validation build, 
but none when I ran locally. The source code is at @../../src/apache/hbase/ and 
the logs for good run is at @Good-Run and bad run is at @bad-run. Can you 
compare the logs and try to understand why the run failed and look at the code 
for a potential fix?
    
    Summary by AI:
    
    The test is flapping due to a **timing/race condition** in the quota system:
    
    1. **5-second timeout too short**: The `testTraffic` method only waited 5 
seconds for quotas to take effect
    2. **Quota cache not fully propagated**: On slower systems (like CI), the 
quota cache refresh can be asynchronous and may not fully propagate in time
    3. **Quotas bypassed**: When cache isn't refreshed, the logs show `"bypass 
expected false, actual true"`, meaning all requests succeed instead of being 
throttled
    4. **Insufficient retries**: Each iteration takes ~1.3 seconds, so only 3-4 
retries fit in 5 seconds, not enough for the quota system to stabilize
    
    **Bad run pattern:**
    - Test expects 1 successful request but gets 5 (all succeed because quotas 
not enforced)
    - Retries every ~1.3 seconds for 4 attempts
    - Times out after 5 seconds with "Waiting timed out after [5,000] msec"
    
    **Good run pattern:**
    - Quotas enforced immediately
    - Tests pass quickly (36.97s total vs 63.14s for failed run)
    
    Increased the timeout in `testTraffic()` from **5,000ms to 30,000ms** (line 
263). This gives the quota system sufficient time to:
    - Complete cache refresh
    - Propagate quota settings across all components
    - Handle slower CI environments
    
    This is a conservative fix that maintains the retry logic while allowing 
adequate time for the distributed quota system to stabilize. The 30-second 
timeout is still reasonable for a test and should handle the asynchronous 
nature of quota enforcement.
    
    Signed-off-by: Duo Zhang <[email protected]>
---
 .../java/org/apache/hadoop/hbase/quotas/TestBlockBytesScannedQuota.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/quotas/TestBlockBytesScannedQuota.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/quotas/TestBlockBytesScannedQuota.java
index 79e84349292..67cafa53c5f 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/quotas/TestBlockBytesScannedQuota.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/quotas/TestBlockBytesScannedQuota.java
@@ -260,7 +260,7 @@ public class TestBlockBytesScannedQuota {
 
   private void testTraffic(Callable<Long> trafficCallable, long 
expectedSuccess, long marginOfError)
     throws Exception {
-    TEST_UTIL.waitFor(5_000, () -> {
+    TEST_UTIL.waitFor(30_000, () -> {
       long actualSuccess;
       try {
         actualSuccess = trafficCallable.call();

Reply via email to