gortiz opened a new issue, #18539:
URL: https://github.com/apache/pinot/issues/18539

   ## Background
   
   PR #18519 introduced a three-layer gRPC mailbox back-pressure design for the 
multi-stage engine:
   
   1. an application-level `isReady()` gate in `GrpcSendingMailbox`,
   2. transport-layer tuning (HTTP/2 flow-control window + Netty 
`WriteBufferWaterMark`), and
   3. manual receiver-side inbound flow control with prefetched credit.
   
   During review, the [comment on `MailboxService.java:217` (id 
`3269229763`)](https://github.com/apache/pinot/pull/18519#discussion_r3269229763)
 flagged that the residual unbounded-direct-memory risk at large fan-outs was 
understated by the existing TODO comment. This issue is filed to track the 
missing fourth layer: an application-level **global byte budget** across all 
outbound peers for a single sender (and ideally, across all concurrent queries 
on the same JVM).
   
   ## Why the existing knobs are not enough
   
   The current bounds are per-edge, not global:
   
   | Layer | Per-edge bound | Multiplier | Effective per-sender cap |
   | --- | --- | --- | --- |
   | Sender, transport (Netty `WriteBufferWaterMark.high`) | 
`writeBufferHighWaterMark` | `× #peers` | `writeBufferHighWaterMark × #peers` |
   | Receiver, transport (HTTP/2 stream window) | `flowControlWindow` | `× 
#incoming_streams` | `flowControlWindow × #incoming_streams` |
   
   Both are multiplied again by `#concurrent_queries` on a JVM serving multiple 
MSE queries in parallel.
   
   ## When this becomes a problem
   
   At the new 64 MiB defaults shipped in #18519, a 50–100 server cluster pins:
   
   - 50 peers: `64 MiB × 50 ≈ 3.2 GiB` direct memory per sender, per concurrent 
query
   - 100 peers: `64 MiB × 100 ≈ 6.4 GiB` direct memory per sender, per 
concurrent query
   
   A handful of concurrent queries with large fan-outs (e.g. a broadcast join 
across the whole cluster) reach the original `OutOfDirectMemoryError` failure 
mode that PR #18519 was supposed to bound, just shifted from "unbounded" to 
"bounded by a large product".
   
   ## Proposed fix
   
   Add a single configurable **global byte budget** across all peers for the 
gRPC sender path, implemented as a semaphore-style permit:
   
   - A `Semaphore`-like primitive sized at startup from a new 
`CommonConstants.MultiStageQueryRunner` key 
(`pinot.query.runner.grpc.sender.global.byte.budget.bytes` or similar).
   - Acquired around each `sendContent` call (best-effort: acquire 
bytes-of-payload permits before write; release once the channel reports 
`isReady()` again or the message is acked).
   - Defaulted to **disabled** (`-1` / `0`) to preserve current behaviour; only 
operators that have observed the failure mode opt in.
   - Documented sizing formula on the new key: a reasonable starting point is 
`0.5 × -XX:MaxDirectMemorySize` minus headroom for the receiver-side 
`flowControlWindow × #incoming_streams` term and any other direct-memory 
consumers (Netty pooled byte buffers for non-mailbox RPCs, etc.).
   
   This keeps the existing transport-layer knobs as the first-line bound (which 
is enough for small/moderate clusters) and only adds the global cap as a hard 
ceiling for large-fan-out / high-concurrency deployments.
   
   ## Trade-offs / open questions
   
   - Per-sender vs per-JVM budget: a per-JVM budget needs a shared `Semaphore` 
in `MailboxService` (or hoisted higher); a per-sender budget is cheaper but 
does not bound the multi-query case. The per-JVM scope is the one that matches 
the original OOM failure mode.
   - How to count: payload bytes are a reasonable proxy but ignore Netty 
allocator overhead and the gRPC framing.
   - Should this be metric-instrumented (current permits held / waiters) before 
being recommended for general use? Probably yes.
   
   ## References
   
   - PR #18519: https://github.com/apache/pinot/pull/18519
   - Review comment that surfaced the residual risk: 
https://github.com/apache/pinot/pull/18519#discussion_r3269229763 
(`MailboxService.java:217`)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to