janhoy opened a new pull request, #4312:
URL: https://github.com/apache/solr/pull/4312

   https://issues.apache.org/jira/browse/SOLR-18203
   
   Below problem/fix description and the fix is done by Claude AI. I have 
limited insight into the actual chaos test and the CloudSolrClient's retry 
logic. Thus the problem description could be mistaken, and even if the fix 
seems to work, there could exist a more correct fix. Please give your 5 cents.
   
   ### Problem
   
   `ChaosMonkeySafeLeaderWithPullReplicasTest` has been failing at 82–100% for 
months ([fucit 
report](http://fucit.org/solr-jenkins-reports/failure-report.html)).
   <img width="1137" height="242" alt="Skjermbilde 2026-04-21 kl  15 00 00" 
src="https://github.com/user-attachments/assets/c1ba21d7-3543-4147-bed8-a04bf39ecc65";
 />
   
   The test asserts zero update exceptions during chaos monkey (random node 
kills). The failure chain:
   
   1. Chaos monkey kills the shard leader → node enters **graceful Jetty 
shutdown** (15s timeout, introduced in 
[SOLR-17744](https://issues.apache.org/jira/browse/SOLR-17744) April 2025), 
physically alive but logically dead
   2. `CloudSolrClient` routes the next update to the cached (dead) leader → 
**503 SERVICE_UNAVAILABLE**
   3. `requestWithRetryOnStaleState()` sees `RouteException(503)` and retries — 
but all **5 retries fire within milliseconds** of each other, using the same 
stale routes
   4. A 3-second backoff suppresses the ZK state refresh, so every retry hits 
the same dead leader → all 503 → update fails → test assertion `assertEquals(0, 
getFailCount())` fails
   
   Note that even if `SOLR_JETTY_GRACEFUL` is `false` by default in production, 
the Test runner always enables it. So this bug would only appear in tests.
   
   ### Fix
   
   In the 503 retry path, **wait for a ZooKeeper cluster-state refresh** 
(blocking, ~100ms) before each retry attempt. This spreads retries over the 
leader election period so the next retry routes to the newly elected leader.
   
   Also bypass the `markMaybeStaleIfOutsideBackoff` 3-second backoff for 503 
errors, since the per-retry wait already throttles the rate naturally.
   
   The `waitForCollectionRefresh` / `triggerCollectionRefresh` pattern is 
already used for `INVALID_STATE`/404 stale-state handling — this extends it to 
the 503 case.
   
   ### Testing
   
   Ran `ChaosMonkeySafeLeaderWithPullReplicasTest` several times both on main 
and this feature branch. On main I got 8 failures out of 20 runs. On this pr 
branch I got 5 failures for 20 runs. Thus the fix is no 100% or there may be 
other code paths in play.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to