merlimat opened a new pull request, #25675:
URL: https://github.com/apache/pulsar/pull/25675

   ### Motivation
   
   `SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover` waits 
for the failover state machine to converge across three phases (failover 0→2, 
recover 2→1, recover 1→0). Each phase uses an `Awaitility.untilAsserted(...)` 
lambda that combined three assertions:
   
   1. The per-index `pulsarServiceStateArray` matches the expected states.
   2. `producer.send(...)` succeeds.
   3. `failover.getCurrentPulsarServiceIndex()` returns the expected index.
   
   When the failover state has converged but the producer's underlying 
connection is still being re-established (`updateServiceUrl` calls 
`cnxPool.closeAllConnections()`), the `producer.send(...)` retry inside the 
lambda can stall up to the producer's send timeout (~30s). Each retry of the 
lambda then burns ~30s of the per-phase budget, even though the failover state 
machine itself already settled. On slow CI agents this causes the per-phase 
120s budget to time out at phase 3 with `expected [true] but found [false]`.
   
   Example failure: 
https://scans.gradle.com/s/xiv7nu4ujnh5c/tests/task/:pulsar-broker:test/details/org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest/testAutoClusterFailover%5B4%5D(false)/1/output
   
   ### Modifications
   
   Split the convergence check from the side checks per phase:
   
   - Wait inside `Awaitility.untilAsserted(...)` only for the per-index state 
and `currentPulsarServiceIndex` (cheap reads on the failover executor).
   - Move `producer.send(...)` outside the await loop so it runs once per phase 
and surfaces send failures directly.
   
   Also extracted small helpers (`awaitStatesAndIndex`, `assertStatesEqual`) to 
remove the repetitive submit-future-join boilerplate, and bumped the per-phase 
budget to 180s with an overall 12-minute timeout (the probe timeout is 3s and a 
single failed probe during recovery resets `recoverThreshold`, so a phase can 
need up to ~30s of healthy probes to recover).
   
   ### Verifying this change
   
   This change is already covered by existing tests: 
`SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover` (TLS and 
non-TLS variants).
   
   Locally I ran 3 times in a row with fresh Gradle daemons; each run took ~17s 
per variant and all passed.
   
   ### Does this pull request potentially affect one of the following parts:
   
   - [ ] Dependencies (add or upgrade a dependency)
   - [ ] The public API
   - [ ] The schema
   - [ ] The default values of configurations
   - [ ] The threading model
   - [ ] The binary protocol
   - [ ] The REST endpoints
   - [ ] The admin CLI options
   - [ ] The metrics
   - [ ] Anything that affects deployment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to