merlimat opened a new pull request, #25826:
URL: https://github.com/apache/pulsar/pull/25826

   ### Motivation
   
   `SameAuthParamsLookupAutoClusterFailover` maintains a per-service-index 
state array (`Healthy`, `PreFail`, `Failed`, `PreRecover`) updated by a 
periodic check loop. The check loop only probes indices 
`0..currentPulsarServiceIndex`:
   
   ```java
   private void checkPulsarServices() {
       for (int i = 0; i <= currentPulsarServiceIndex; i++) {
           ...
       }
   }
   ```
   
   When we recover to a higher-priority service (`currentPulsarServiceIndex` 
decreases), the loop stops probing the higher-indexed services. If any of those 
indices were in a transient state at the moment of recovery (e.g., `PreFail` 
from a single timed-out probe), they get stuck there because nothing ever 
probes them again to flip them back to `Healthy`.
   
   This causes 
`SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover` to fail 
intermittently:
   
   ```
   Caused by: java.lang.AssertionError: Arrays differ at element [2]: Healthy 
!= PreFail expected [Healthy] but found [PreFail]
   ```
   
   Concretely: while `currentPulsarServiceIndex` is 2 and the test is waiting 
for index 1 to recover, a single transient probe failure on pulsar2 transitions 
`state[2]: Healthy -> PreFail`. A subsequent successful probe at index 1 
reaches `recoverThreshold` and triggers `updateServiceUrl(1)`, dropping 
`currentPulsarServiceIndex` to 1. From that point on the loop only probes 
indices 0 and 1, and `state[2]` stays at `PreFail` forever — the test's 
3-minute await window never sees `state[2] == Healthy`.
   
   Example failure: 
https://scans.gradle.com/s/7pttiiyo6yybc/tests/task/:pulsar-broker:test/details/org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest/testAutoClusterFailover%5B2%5D(true)/1/output
   
   ### Modifications
   
   In `updateServiceUrl`, when recovering (target index < current index), reset 
state of indices above the new target to `Healthy` and zero their counters. The 
state of an unprobed index is not meaningful — resetting it ensures (a) a 
subsequent failover starts from a clean baseline if it needs to consider those 
services again, and (b) we don't leave stale transient state lying around.
   
   ### Verifying this change
   
   This change is already covered by 
`SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover` (TLS and 
non-TLS variants). Locally I ran the test 3 times with `--rerun-tasks`; all 
passed in ~90s each.
   
   ### Does this pull request potentially affect one of the following parts:
   
   - [ ] Dependencies (add or upgrade a dependency)
   - [ ] The public API
   - [ ] The schema
   - [ ] The default values of configurations
   - [ ] The threading model
   - [ ] The binary protocol
   - [ ] The REST endpoints
   - [ ] The admin CLI options
   - [ ] The metrics
   - [ ] Anything that affects deployment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to