haridsv commented on PR #8117: URL: https://github.com/apache/hbase/pull/8117#issuecomment-4301939754
Here is a more detailed reasoning by AI. Attaching the log files from bad run. [bad-run.tar.gz](https://github.com/user-attachments/files/26998411/bad-run.tar.gz) Here are the key pieces of evidence: ## 1. The timeout error (bad run) **File:** `bad-run/org.apache.hadoop.hbase.quotas.TestBlockBytesScannedQuota.txt`, line 6: ``` java.lang.AssertionError: Waiting timed out after [5,000] msec ``` All 3 failures share this same 5-second timeout at `TestBlockBytesScannedQuota.java:263`. ## 2. Quota cache stuck in bypass mode (bad run) **File:** `bad-run/org.apache.hadoop.hbase.quotas.TestBlockBytesScannedQuota-output.txt`, repeated pattern starting around timestamp `18:29:34`: ``` Traffic test yielded 5 successful requests. Expected 1 +/- 0 User limiter for user=runner ... not refreshed, bypass expected false, actual true ``` The test expects only 1 request to succeed (because quotas should throttle), but **5 succeed every time** because the limiter is still in `bypass=true` (no quota enforced). The retry loop hits this same state 4 times in ~4 seconds, then times out at 5 seconds. ## 3. Compare with good run **File:** `good-run/org.apache.hadoop.hbase.quotas.TestBlockBytesScannedQuota-output.txt`: ``` Traffic test yielded 1 successful requests. Expected 1 +/- 1 ``` Quota enforcement kicks in immediately — no "not refreshed" messages at all for these assertions. ## 4. The 5-second timeout in source code **File:** `TestBlockBytesScannedQuota.java`, line 263: ```java TEST_UTIL.waitFor(5_000, () -> { ``` ## 5. The retry penalty eats the budget **File:** `TestBlockBytesScannedQuota.java`, lines 274-277: ```java if (!success) { triggerUserCacheRefresh(TEST_UTIL, false, TABLE_NAME); waitMinuteQuota(); } ``` Each failed attempt calls `triggerUserCacheRefresh`, which in **`ThrottleQuotaTestUtil.java` line 224** does `Thread.sleep(250)` plus a `waitFor(60000, 250, ...)` poll loop, and `waitMinuteQuota()` (line 316) calls `envEdge.incValue(70000)`. The real wall-clock cost of each retry is ~1.3 seconds (visible in the bad-run log timestamps: `18:29:34` → `18:29:35` → `18:29:36` → `18:29:38`), so only 3-4 attempts fit within the 5-second window. On CI (slower), the quota cache takes longer to propagate from bypass to enforcing. The good run on your local machine was fast enough that it resolved on the first attempt. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
