haridsv opened a new pull request, #8117:
URL: https://github.com/apache/hbase/pull/8117

   AI tool: Claude Code
   AI Model: Sonnet 4.5 (reverified using Opus 4.6 (1M context))
   Prompt: You need to diagnose the HBase test TestBlockBytesScannedQuota that 
is flapping. It had multiple failures and errors in the PR validation build, 
but none when I ran locally. The source code is at @../../src/apache/hbase/ and 
the logs for good run is at @Good-Run and bad run is at @bad-run. Can you 
compare the logs and try to understand why the run failed and look at the code 
for a potential fix?
   
   Summary by AI:
   
   The test is flapping due to a **timing/race condition** in the quota system:
   
   1. **5-second timeout too short**: The `testTraffic` method only waited 5 
seconds for quotas to take effect
   2. **Quota cache not fully propagated**: On slower systems (like CI), the 
quota cache refresh can be asynchronous and may not fully propagate in time
   3. **Quotas bypassed**: When cache isn't refreshed, the logs show `"bypass 
expected false, actual true"`, meaning all requests succeed instead of being 
throttled
   4. **Insufficient retries**: Each iteration takes ~1.3 seconds, so only 3-4 
retries fit in 5 seconds, not enough for the quota system to stabilize
   
   **Bad run pattern:**
   - Test expects 1 successful request but gets 5 (all succeed because quotas 
not enforced)
   - Retries every ~1.3 seconds for 4 attempts
   - Times out after 5 seconds with "Waiting timed out after [5,000] msec"
   
   **Good run pattern:**
   - Quotas enforced immediately
   - Tests pass quickly (36.97s total vs 63.14s for failed run)
   
   Increased the timeout in `testTraffic()` from **5,000ms to 30,000ms** (line 
263). This gives the quota system sufficient time to:
   - Complete cache refresh
   - Propagate quota settings across all components
   - Handle slower CI environments
   
   This is a conservative fix that maintains the retry logic while allowing 
adequate time for the distributed quota system to stabilize. The 30-second 
timeout is still reasonable for a test and should handle the asynchronous 
nature of quota enforcement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to