merlimat opened a new pull request, #25715:
URL: https://github.com/apache/pulsar/pull/25715

   ### Motivation
   
   `ServiceUnitStateChannel.start()` is invoked via 
`pulsar.runWhenReadyForIncomingRequests(...)`, so the broker accepts HTTP 
requests before the channel reaches `Started`. During that window any topic 
lookup hits `getOwnerAsync` which fails immediately with:
   
   ```
   java.lang.IllegalStateException: Invalid channel 
state:LeaderElectionServiceStarted
       at 
o.a.p.b.l.e.channel.ServiceUnitStateChannelImpl.getOwnerAsync(ServiceUnitStateChannelImpl.java:539)
   ```
   
   The `@BeforeMethod startBroker` previously included a 
`lookupPartitionedTopic` probe. After a broker restart that probe hit the 
channel-startup race: each poll iteration failed fast with the channel-state 
error and the loop just spun until the 180s budget expired, without giving the 
channel any time to actually finish initialization. Under CI resource 
contention, `channel.start()` (driven by `tableview.fill()` loading existing 
bundles) can take 60–90s, exceeding the awaitility budget once a follow-on 
`deferGetOwner` 30s timeout is layered on top.
   
   Failure trace observed on CI for cluster `MultiLoadManagerTest-ee82e900-…`:
   
   ```
   21:08:23 WARN  Broker is not ready yet {broker=pulsar-broker-1, yet=
    --- An unexpected error occurred in the server ---
   Message: Invalid channel state:LeaderElectionServiceStarted
   …
   21:11:39 WARN  Broker is not ready yet {broker=pulsar-broker-1, 
yet=java.util.concurrent.TimeoutException}
   …
   21:13:36 WARN  Broker is not ready yet {broker=pulsar-broker-1, 
yet=java.lang.InterruptedException}
   ```
   
   ### Modifications
   
   Drop the `createPartitionedTopic` + `lookupPartitionedTopic` probe from 
`@BeforeMethod startBroker`. The remaining `getActiveBrokers().size() == 
NUM_BROKERS` check already verifies the broker is reachable and sees the 
cluster, which is what the `@BeforeMethod` actually needs to assert before the 
next test runs.
   
   The underlying broker-side race (`getOwnerAsync` failing immediately rather 
than waiting for `Started`) is a separate broker-level issue and is 
intentionally out of scope here.
   
   ### Verifying this change
   
   This change is a trivial test fix; the existing tests cover the load-manager 
behavior.
   
   ### Does this pull request potentially affect one of the following parts:
   
   - [ ] Dependencies (add or upgrade a dependency)
   - [ ] The public API
   - [ ] The schema
   - [ ] The default values of configurations
   - [ ] The threading model
   - [ ] The binary protocol
   - [ ] The REST endpoints
   - [ ] The admin CLI options
   - [ ] The metrics
   - [ ] Anything that affects deployment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to