RockteMQ-AI commented on issue #10543: URL: https://github.com/apache/rocketmq/issues/10543#issuecomment-4777039190
**Issue Evaluation** Category: `type/bug` | Status: **Confirmed** The reported race condition in `ServiceThread#wakeup()` / `waitForRunning()` has been verified against the current codebase (`develop` branch, commit `b5bc1ff`). **Root Cause Analysis:** The race window exists between the fast-path CAS failure in `waitForRunning()` and the subsequent `waitPoint.reset()`: 1. Thread A (`waitForRunning`): fast-path CAS `hasNotified(true→false)` fails → proceeds to `reset()` 2. Thread B (`wakeup`): CAS `hasNotified(false→true)` succeeds → `waitPoint.countDown()` → latch state 1→0 3. Thread A: `waitPoint.reset()` calls `setState(startCount)` — **unconditionally resets state to 1**, discarding the `countDown` 4. Thread A: `waitPoint.await(interval)` blocks for the full interval (default 1000ms) `CountDownLatch2.reset()` uses an unconditional `setState(startCount)` (not CAS), so any prior `countDown()` is lost. **Impact:** ServiceThread-based components (CommitLog, FlushRealTimeService, etc.) may experience up to `interval` ms latency spikes under concurrent wakeup pressure. While the system self-heals on the next cycle, this can cause periodic tail latency. **Severity:** Medium — self-healing but causes unnecessary latency spikes. The suggested fix direction (replacing `CountDownLatch2` with `LockSupport.park/unpark`) is sound, as `LockSupport` does not have this reset-vs-countDown race. An automated fix proposal will be generated. Reply `/approve` to proceed with PR generation. --- *Automated evaluation by github-manager-bot* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
