Re: [PR] Add probe-side runtime filters for inner joins in the multi-stage engine [pinot]

via GitHub Sun, 28 Jun 2026 09:40:37 -0700


yashmayya commented on PR #18848:
URL: https://github.com/apache/pinot/pull/18848#issuecomment-4826730245


   Thanks for the thorough review, @gortiz — these are all fair. Addressed 
below; the branch is updated.
   
   **1. Exact-`IN` footprint cap.** Fixed. `attachRuntimeFilter` now estimates 
the exact-`IN` serialized footprint (`estimateExactInBytes`, one IN list per 
key over the build rows) and **abandons** the filter when it would exceed the 
same `maxBytes` ceiling the bloom tier honors — so there is now one consistent 
footprint ceiling regardless of tier (exact `IN` mode, multi-key, or AUTO). The 
SEMI-shared `attachDynamicFilter` is deliberately left untouched: a SEMI join 
replaces the join with the leaf `IN`, so it must always emit it. Abandoning is 
correctness-neutral here (the real hash join still runs). Unit tests cover the 
abandon-above and keep-below behavior for every storable key type 
(INT/LONG/FLOAT/DOUBLE/STRING/BYTES/BIG_DECIMAL), the strict footprint boundary 
(`== maxBytes` keeps, one byte over abandons), and the multi-key case.
   
   **2. Sizing knobs not tunable.** Kept as fixed constants for v1, as a 
conscious decision (and thanks for the explicit nudge). They only affect 
selectivity/footprint, never correctness, so I'd rather not ship 
`CONFIG_OF_*`/query-option surface that isn't wired end-to-end (the leaf would 
need them threaded through `RuntimeFilterNode`); a follow-up can add real 
config + thread it. Tying back to (1): `maxBytes` now bounds *every* tier, so 
the one knob that governs footprint applies uniformly.
   
   **3. Benchmark.** Added `BenchmarkRuntimeFilterJoin` in `pinot-perf`: a 
1M-row, unique-key (1:1) self-join that **projects a realistically wide probe 
row** (`intCol, longCol, doubleCol, strCol, str2Col, str3Col`) so the 
probe-side shuffle — the thing the filter reduces — dominates, which is the 
stated motivation (a wide fact table shuffled needlessly). A `WHERE t2.intCol % 
buildMod = 0` controls build selectivity; I compare `filter=off` (baseline, no 
hint) vs `filter=auto` (hint-enabled; cluster default stays off) on INT and 
STRING join keys. `buildMod=10000` (~0.01% build → AUTO picks exact `IN`) is 
the high-selectivity case where the reducer can shrink the probe; `buildMod=1` 
(100% build → AUTO picks bloom) is the no-reduction case the feature must be 
left off for. `countJoinInt` (only the key crosses the network) is a reference 
for the overhead floor. Numbers:
   
   | Workload (1M-row self-join, wide probe projection) | Build side | `off` | 
`auto` | Effect |
   |---|---|--:|--:|---|
   | `SELECT` wide, **STRING** key | 100 rows (0.01%), exact `IN` | 139.1 ± 
16.0 | 8.7 ± 1.4 | **15.9× faster** |
   | `SELECT` wide, **INT** key | 100 rows (0.01%), exact `IN` | 7.8 ± 2.3 | 
8.0 ± 2.1 | break-even (within noise) |
   | `SELECT` wide, **STRING** key | 1M rows (100%), bloom | 1259 ± 86 | 1366 ± 
112 | 1.08× slower |
   | `SELECT` wide, **INT** key | 1M rows (100%), bloom | 1169 ± 93 | 1292 ± 
106 | 1.10× slower |
   | `COUNT(*)`, **INT** key (reference) | 100 rows (0.01%) | 7.8 ± 2.8 | 8.6 ± 
2.6 | 1.10× slower |
   | `COUNT(*)`, **INT** key (reference) | 1M rows (100%) | 75.5 ± 10.3 | 210.5 
± 57.8 | 2.79× slower |
   
   <sub>ms/op, AverageTime, 5×5 iterations, 1 fork, 2 servers, JDK 21; ± is the 
99.9% CI.</sub>
   
   Takeaways:
   - **Big, real win where the probe is genuinely expensive to ship through the 
join** — the wide-row STRING-key join with a 0.01% build: **15.9×** (139 → 8.7 
ms). This is the canonical "fact table shuffled needlessly" case: the reducer 
cuts the probe to the ~100 matching keys before the `hash` exchange.
   - **Break-even where the probe is already cheap** — the INT-key join at 1M 
rows: 7.8 → 8.0 ms, *within the measurement noise*. The baseline is already at 
the `COUNT` floor (the INT probe is cheap to hash/shuffle at this scale), so 
there's little probe-side cost for the reducer to remove and its overhead just 
cancels out. The benefit grows with probe scale and per-row key cost; it does 
not regress here.
   - **Cost when nothing can be reduced** — the 100% build (`buildMod=1`): 
1.08–1.10× slower on the wide selects, and 2.79× on the bare `COUNT` (where 
broadcasting/applying the 1M-key bloom dominates an otherwise ~75 ms query and 
the probe can't shrink because every key matches). This is exactly the hazard 
you flagged in (1), and why the feature is opt-in and off by default.
   
   The plan is additive — `EXPLAIN` for the hinted INT case (trimmed):
   ```
   LogicalJoin(condition=[=($2, $6)], joinType=[inner])
     PinotLogicalExchange(distribution=[hash[2]])
       RuntimeFilterRel(probeKeys=[[2]], buildKeys=[[0]], filterType=[AUTO])
         LogicalProject(...) / PinotLogicalTableScan(...)                 -- 
probe leaf
         PinotLogicalExchange(distribution=[broadcast], 
relExchangeType=[PIPELINE_BREAKER])
           LogicalSort(fetch=[1048577])                                   -- 
maxBuildRows + 1 truncation guard
             LogicalProject(id) / LogicalFilter(=cat,'rare') / TableScan  -- 
build keys
     PinotLogicalExchange(distribution=[hash[0]])                         -- 
real hash join, unchanged
       LogicalProject(id) / LogicalFilter(=cat,'rare') / TableScan
   ```
   The reducer sits on the probe leaf, fed by a pipeline-breaker broadcast of 
the (capped) build keys; the hash join below is untouched, so it remains the 
source of truth and the reduction — not a plan change — is what drives the win.
   
   **4. NaN in exact-`IN`.** Confirmed they agree. Added 
`testNaNKeyMatchesBaseline`: a self-join on a `DOUBLE` column where ids 0–2 are 
`NaN`. The MSE hash join matches `NaN = NaN` (baseline `COUNT(*) = 9`: 3 NaN 
probe × 3 NaN build), and the exact-`IN`, bloom, and AUTO reducers all 
reproduce exactly that — so the leaf `IN` predicate and the join agree on 
`NaN`, no probe-`NaN` row is dropped. (Consistent with the existing SEMI path, 
now pinned by a test.)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add probe-side runtime filters for inner joins in the multi-stage engine [pinot]

Reply via email to