heguanhui opened a new issue, #64189:
URL: https://github.com/apache/doris/issues/64189

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   ### Version
   
   master
   
   ### What's Wrong?
   
   `BlockFileCacheTest` has two types of flaky failures:
   
   ---
   
   **Type 1: Background thread interference — `ttl_modify` failure**
   
   The `test_file_cache()` helper creates a `BlockFileCache` that starts 
background threads. These threads asynchronously modify cache state between 
test assertions, causing `file_block->state()` to return EMPTY instead of 
SKIP_CACHE:
   
   ```
   [ RUN      ] BlockFileCacheTest.ttl_modify
   be/test/io/cache/block_file_cache_test.cpp:447: Failure
   Expected: file_block->state() == io::FileBlock::State::SKIP_CACHE
   Actual:    EMPTY == SKIP_CACHE
   ```
   
   Root cause: the background `evict_in_advance` thread evicts releasable 
DOWNLOADED blocks, freeing space so that `try_reserve()` unexpectedly succeeds, 
keeping the state as EMPTY instead of transitioning to SKIP_CACHE.
   
   ---
   
   **Type 2: Background thread interference — `io_error` failure**
   
   Same root cause as Type 1, but manifests as incorrect block count because 
EMPTY blocks are not removed due to `use_count>2` from background thread 
references:
   
   ```
   [ RUN      ] BlockFileCacheTest.io_error
   be/test/io/cache/block_file_cache_test.cpp:530: Failure
   Expected: mgr.get_file_blocks_num(key) == 9
   Actual:    10 == 9
   ```
   
   Root cause: when a `FileBlocksHolder` destructor tries to remove EMPTY 
blocks (`use_count()==2`), the background thread or another holder still holds 
a reference (`use_count()>2`), preventing removal and leaving the block in the 
queue.
   
   ---
   
   **Type 3: Insufficient async open timeout — `evict_privilege_order_for_ttl` 
failure**
   
   `initialize()` starts a background disk I/O loading thread that sets 
`_async_open_done=true` only on completion. Tests wait only 100ms (100 
iterations × 1ms), which is insufficient under high CPU load:
   
   ```
   [ RUN      ] BlockFileCacheTest.evict_privilege_order_for_ttl
   be/test/io/cache/block_file_cache_test.cpp:6980: Failure
   Expected: cache.get_or_set(key1, offset, 100000, context1) to succeed
     (async open not completed, cache not ready)
   ```
   
   Root cause: the 100ms total timeout is too short when the system is under 
load. The background loading thread needs more time to complete disk I/O and 
set `_async_open_done=true`.
   
   ---
   
   ### What You Expected?
   
   Tests should be deterministic and not affected by background threads or 
timing issues.
   
   ### How to Reproduce?
   
   Run `BlockFileCacheTest` repeatedly under CPU load. The flaky failures 
appear intermittently.
   
   ### Anything Else?
   
   All three are test defects, not business code defects. The fix:
   1. For Type 1 & 2: Save and restore config values, set all background thread 
intervals to 10000000ms during `test_file_cache` and 
`test_file_cache_memory_storage` to prevent background thread interference.
   2. For Type 3: Extract `wait_for_async_open()` helper with 1000 iterations × 
10ms (10s total timeout), replace all 91 inline wait loops. The 10ms sleep 
interval also avoids exacerbating CPU pressure under load compared to the 
original 1ms interval.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to