gortiz opened a new issue, #18539: URL: https://github.com/apache/pinot/issues/18539
## Background PR #18519 introduced a three-layer gRPC mailbox back-pressure design for the multi-stage engine: 1. an application-level `isReady()` gate in `GrpcSendingMailbox`, 2. transport-layer tuning (HTTP/2 flow-control window + Netty `WriteBufferWaterMark`), and 3. manual receiver-side inbound flow control with prefetched credit. During review, the [comment on `MailboxService.java:217` (id `3269229763`)](https://github.com/apache/pinot/pull/18519#discussion_r3269229763) flagged that the residual unbounded-direct-memory risk at large fan-outs was understated by the existing TODO comment. This issue is filed to track the missing fourth layer: an application-level **global byte budget** across all outbound peers for a single sender (and ideally, across all concurrent queries on the same JVM). ## Why the existing knobs are not enough The current bounds are per-edge, not global: | Layer | Per-edge bound | Multiplier | Effective per-sender cap | | --- | --- | --- | --- | | Sender, transport (Netty `WriteBufferWaterMark.high`) | `writeBufferHighWaterMark` | `× #peers` | `writeBufferHighWaterMark × #peers` | | Receiver, transport (HTTP/2 stream window) | `flowControlWindow` | `× #incoming_streams` | `flowControlWindow × #incoming_streams` | Both are multiplied again by `#concurrent_queries` on a JVM serving multiple MSE queries in parallel. ## When this becomes a problem At the new 64 MiB defaults shipped in #18519, a 50–100 server cluster pins: - 50 peers: `64 MiB × 50 ≈ 3.2 GiB` direct memory per sender, per concurrent query - 100 peers: `64 MiB × 100 ≈ 6.4 GiB` direct memory per sender, per concurrent query A handful of concurrent queries with large fan-outs (e.g. a broadcast join across the whole cluster) reach the original `OutOfDirectMemoryError` failure mode that PR #18519 was supposed to bound, just shifted from "unbounded" to "bounded by a large product". ## Proposed fix Add a single configurable **global byte budget** across all peers for the gRPC sender path, implemented as a semaphore-style permit: - A `Semaphore`-like primitive sized at startup from a new `CommonConstants.MultiStageQueryRunner` key (`pinot.query.runner.grpc.sender.global.byte.budget.bytes` or similar). - Acquired around each `sendContent` call (best-effort: acquire bytes-of-payload permits before write; release once the channel reports `isReady()` again or the message is acked). - Defaulted to **disabled** (`-1` / `0`) to preserve current behaviour; only operators that have observed the failure mode opt in. - Documented sizing formula on the new key: a reasonable starting point is `0.5 × -XX:MaxDirectMemorySize` minus headroom for the receiver-side `flowControlWindow × #incoming_streams` term and any other direct-memory consumers (Netty pooled byte buffers for non-mailbox RPCs, etc.). This keeps the existing transport-layer knobs as the first-line bound (which is enough for small/moderate clusters) and only adds the global cap as a hard ceiling for large-fan-out / high-concurrency deployments. ## Trade-offs / open questions - Per-sender vs per-JVM budget: a per-JVM budget needs a shared `Semaphore` in `MailboxService` (or hoisted higher); a per-sender budget is cheaper but does not bound the multi-query case. The per-JVM scope is the one that matches the original OOM failure mode. - How to count: payload bytes are a reasonable proxy but ignore Netty allocator overhead and the gRPC framing. - Should this be metric-instrumented (current permits held / waiters) before being recommended for general use? Probably yes. ## References - PR #18519: https://github.com/apache/pinot/pull/18519 - Review comment that surfaced the residual risk: https://github.com/apache/pinot/pull/18519#discussion_r3269229763 (`MailboxService.java:217`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
