yashmayya opened a new pull request, #18268:
URL: https://github.com/apache/pinot/pull/18268

   - **Problem:** Observed in prod — a server node became kernel-dead (kubelet 
dead, pod stuck terminating). The broker's MSE dispatch gRPC channel stayed 
`READY` (no keep-alive configured), `FailureDetector` never fired, and 
`WorkerManager` kept picking the dead server as an intermediate-stage worker on 
unrelated queries, causing cluster-wide MSE failures for ~1h38m until the node 
came back. SSE queries were mostly unaffected because per-table 
`InstanceSelector` already filters by `_routableServers`.
   - **Fix 1 — gRPC keep-alive on broker dispatch channels:** plumb keep-alive 
through `DispatchClient` → `QueryDispatcher` → `MultiStageBrokerRequestHandler` 
via three new configs 
(`pinot.query.multistage.dispatch.channel.keep.alive.{time.seconds,timeout.seconds,without.calls}`).
 Activates the dormant ``channel.getState() != READY → markServerUnhealthy`` 
gate in `QueryDispatcher#processResults` so a silently unreachable peer 
transitions the channel out of `READY` and gets excluded from routing.
   - **Fix 2 — MSE routing honors FailureDetector exclusions:** add 
`RoutingManager#getRoutableServerInstanceMap()` (default method delegates to 
`getEnabledServerInstanceMap()`). `BaseBrokerRoutingManager` maintains a 
`volatile` snapshot of ``enabled ∩ _routableServers``, rebuilt in lockstep with 
`_routableServers` under `_globalLock.writeLock()` (O(1) reads, zero per-call 
allocation). `MultiClusterRoutingManager` overrides with its existing 
aggregation pattern. `WorkerManager` switches three direct-read call sites — 
constant-expression query, `useLeafServerForIntermediateStage` fallback, and 
`getCandidateServersPerTables` — to the new accessor so excluded servers are 
filtered from intermediate-stage worker selection. Leaf-stage partition paths 
unchanged since they already flow through per-table `InstanceSelector`.
   - **Why both fixes:** Fix 1 alone wouldn't reach every MSE code path because 
three `WorkerManager` paths read `_enabledServerInstanceMap` directly, which is 
never affected by exclusions. Fix 2 alone wouldn't help in Helix-blind failure 
modes (zombie JVM with healthy ZK session, broker↔server partition with ZK 
intact) because Helix never updates `ExternalView` / `LiveInstance` in those 
scenarios. Together they cover the full failure surface.
   - **Defaults:** Keep-alive is **on by default** with conservative values 
tuned for Netty's default server-side permits — `keepAliveTime=300s` (matches 
`permitKeepAliveTime=5min`), `keepAliveTimeout=30s`, 
`keepAliveWithoutCalls=false` (Netty default forbids pings without calls). This 
catches persistent / long-running dispatches without risking 
`GOAWAY(ENHANCE_YOUR_CALM)`. Operators can tune down (e.g. 30s / 10s / true) 
for per-query detection after configuring server-side `permitKeepAliveTime` and 
`permitKeepAliveWithoutCalls`. Rationale documented in 
`CommonConstants.MultiStageQueryRunner`.
   - **Backward compatibility:** 
`RoutingManager#getRoutableServerInstanceMap()` has a default implementation 
delegating to `getEnabledServerInstanceMap()`, so all existing implementations 
(including test mocks and `MockRoutingManagerFactory`) work unchanged. Existing 
`QueryDispatcher` and `DispatchClient` constructors preserved.
   - **Tests:** Added regression tests in `BrokerRoutingManagerTest` (asserts 
`getRoutableServerInstanceMap()` excludes the server after 
`excludeServerFromRouting()` and reappears after `includeServerToRouting()`) 
and `MultiClusterRoutingManagerTest` (asserts routable-map aggregation across 
clusters). Full suite status on affected modules: `BrokerRoutingManagerTest` 
5/5, `MultiClusterRoutingManagerTest` 13/13, `WorkerManagerTest` 2/2, 
`QueryDispatcherTest` 69/69. Spotless / Checkstyle / license all clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to