[PR] MSE resilience to silently unreachable servers [pinot]

via GitHub Mon, 20 Apr 2026 15:52:20 -0700


yashmayya opened a new pull request, #18268:
URL: https://github.com/apache/pinot/pull/18268

- **Problem:** Observed in prod — a server node became kernel-dead (kubelet
dead, pod stuck terminating). The broker's MSE dispatch gRPC channel stayed
`READY` (no keep-alive configured), `FailureDetector` never fired, and
`WorkerManager` kept picking the dead server as an intermediate-stage worker on
unrelated queries, causing cluster-wide MSE failures for ~1h38m until the node
came back. SSE queries were mostly unaffected because per-table
`InstanceSelector` already filters by `_routableServers`.
- **Fix 1 — gRPC keep-alive on broker dispatch channels:** plumb keep-alive
through `DispatchClient` → `QueryDispatcher` → `MultiStageBrokerRequestHandler`
via three new configs
(`pinot.query.multistage.dispatch.channel.keep.alive.{time.seconds,timeout.seconds,without.calls}`).
Activates the dormant ``channel.getState() != READY → markServerUnhealthy``
gate in `QueryDispatcher#processResults` so a silently unreachable peer
transitions the channel out of `READY` and gets excluded from routing.
- **Fix 2 — MSE routing honors FailureDetector exclusions:** add
`RoutingManager#getRoutableServerInstanceMap()` (default method delegates to
`getEnabledServerInstanceMap()`). `BaseBrokerRoutingManager` maintains a
`volatile` snapshot of ``enabled ∩ _routableServers``, rebuilt in lockstep with
`_routableServers` under `_globalLock.writeLock()` (O(1) reads, zero per-call
allocation). `MultiClusterRoutingManager` overrides with its existing
aggregation pattern. `WorkerManager` switches three direct-read call sites —
constant-expression query, `useLeafServerForIntermediateStage` fallback, and
`getCandidateServersPerTables` — to the new accessor so excluded servers are
filtered from intermediate-stage worker selection. Leaf-stage partition paths
unchanged since they already flow through per-table `InstanceSelector`.
- **Why both fixes:** Fix 1 alone wouldn't reach every MSE code path because
three `WorkerManager` paths read `_enabledServerInstanceMap` directly, which is
never affected by exclusions. Fix 2 alone wouldn't help in Helix-blind failure
modes (zombie JVM with healthy ZK session, broker↔server partition with ZK
intact) because Helix never updates `ExternalView` / `LiveInstance` in those
scenarios. Together they cover the full failure surface.
- **Defaults:** Keep-alive is **on by default** with conservative values
tuned for Netty's default server-side permits — `keepAliveTime=300s` (matches
`permitKeepAliveTime=5min`), `keepAliveTimeout=30s`,
`keepAliveWithoutCalls=false` (Netty default forbids pings without calls). This
catches persistent / long-running dispatches without risking
`GOAWAY(ENHANCE_YOUR_CALM)`. Operators can tune down (e.g. 30s / 10s / true)
for per-query detection after configuring server-side `permitKeepAliveTime` and
`permitKeepAliveWithoutCalls`. Rationale documented in
`CommonConstants.MultiStageQueryRunner`.
- **Backward compatibility:**
`RoutingManager#getRoutableServerInstanceMap()` has a default implementation
delegating to `getEnabledServerInstanceMap()`, so all existing implementations
(including test mocks and `MockRoutingManagerFactory`) work unchanged. Existing
`QueryDispatcher` and `DispatchClient` constructors preserved.
- **Tests:** Added regression tests in `BrokerRoutingManagerTest` (asserts
`getRoutableServerInstanceMap()` excludes the server after
`excludeServerFromRouting()` and reappears after `includeServerToRouting()`)
and `MultiClusterRoutingManagerTest` (asserts routable-map aggregation across
clusters). Full suite status on affected modules: `BrokerRoutingManagerTest`
5/5, `MultiClusterRoutingManagerTest` 13/13, `WorkerManagerTest` 2/2,
`QueryDispatcherTest` 69/69. Spotless / Checkstyle / license all clean.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] MSE resilience to silently unreachable servers [pinot]

Reply via email to