janhoy opened a new pull request, #4312: URL: https://github.com/apache/solr/pull/4312
https://issues.apache.org/jira/browse/SOLR-18203 Below problem/fix description and the fix is done by Claude AI. I have limited insight into the actual chaos test and the CloudSolrClient's retry logic. Thus the problem description could be mistaken, and even if the fix seems to work, there could exist a more correct fix. Please give your 5 cents. ### Problem `ChaosMonkeySafeLeaderWithPullReplicasTest` has been failing at 82–100% for months ([fucit report](http://fucit.org/solr-jenkins-reports/failure-report.html)). <img width="1137" height="242" alt="Skjermbilde 2026-04-21 kl 15 00 00" src="https://github.com/user-attachments/assets/c1ba21d7-3543-4147-bed8-a04bf39ecc65" /> The test asserts zero update exceptions during chaos monkey (random node kills). The failure chain: 1. Chaos monkey kills the shard leader → node enters **graceful Jetty shutdown** (15s timeout, introduced in [SOLR-17744](https://issues.apache.org/jira/browse/SOLR-17744) April 2025), physically alive but logically dead 2. `CloudSolrClient` routes the next update to the cached (dead) leader → **503 SERVICE_UNAVAILABLE** 3. `requestWithRetryOnStaleState()` sees `RouteException(503)` and retries — but all **5 retries fire within milliseconds** of each other, using the same stale routes 4. A 3-second backoff suppresses the ZK state refresh, so every retry hits the same dead leader → all 503 → update fails → test assertion `assertEquals(0, getFailCount())` fails Note that even if `SOLR_JETTY_GRACEFUL` is `false` by default in production, the Test runner always enables it. So this bug would only appear in tests. ### Fix In the 503 retry path, **wait for a ZooKeeper cluster-state refresh** (blocking, ~100ms) before each retry attempt. This spreads retries over the leader election period so the next retry routes to the newly elected leader. Also bypass the `markMaybeStaleIfOutsideBackoff` 3-second backoff for 503 errors, since the per-retry wait already throttles the rate naturally. The `waitForCollectionRefresh` / `triggerCollectionRefresh` pattern is already used for `INVALID_STATE`/404 stale-state handling — this extends it to the 503 case. ### Testing Ran `ChaosMonkeySafeLeaderWithPullReplicasTest` several times both on main and this feature branch. On main I got 8 failures out of 20 runs. On this pr branch I got 5 failures for 20 runs. Thus the fix is no 100% or there may be other code paths in play. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
