gortiz opened a new pull request, #18741:
URL: https://github.com/apache/pinot/pull/18741
Contributes to #18740 (umbrella: cost-based optimization for the multi-stage
query engine).
This PR implements **Phase 1** of the umbrella issue: the broker statistics
foundation and a
gated, cost-based join-reorder phase for the multi-stage engine. Everything
is **off by default**
and broker-local (no wire-format or mixed-version impact). The commits are
structured so the PR
can be split into the stacked PRs listed in the umbrella issue if reviewers
prefer.
## What's included
**Statistics foundation**
- Statistics contracts: `PinotStatisticsProvider`, `TableStatistics`,
`ColumnStatistics`,
`StatConfidence` (planner SPI) and `StatsStore` / `ColumnStatsSource`
(broker SPIs).
- SQLite-backed `StatsStore`: off-heap persistence (WAL, prepared
statements, Flyway-migrated
schema), crc-based restart reconciliation, corruption auto-recovery, purge.
- Broker T0 stats collection from the ZK segment metadata the broker already
watches
(`pinot.broker.stats.enabled`, default **false**): per-segment row count,
size and time bounds
via a `SegmentZkMetadataFetchListener`, with no extra ZK reads on the
query path.
- Table-type semantics so stats are never silently wrong: hybrid
OFFLINE+REALTIME merge at the
broker time boundary (no double counting), upsert/dedup row counts marked
LOW confidence
(physical docs over-count logical rows), consuming segments downgrade
confidence. The planner
treats LOW/UNKNOWN confidence as "no stats".
**Planner integration**
- `PinotTable#getStatistic()` (memoized) exposes row counts to Calcite;
statistics provider is
threaded through `QueryEnvironment.Config` (default no-op ⇒ zero behavior
change).
- A chained `RelMetadataProvider` with stats-backed row counts and
selectivity, including
time-range selectivity computed from segment time boundaries (the primary
time column from the
table config, applied only when its unit is epoch millis).
- A rows-dominated `RelOptCost` implementation with a deterministic total
order.
**Cost-based join reordering** (`useJoinReorder` query option, default
**false**)
- Runs between the logical Hep phases and the trait/physical phases; drives
`JOIN_TO_MULTI_JOIN` + `MULTI_JOIN_OPTIMIZE` (with project/filter merge)
off the
statistics-backed row counts.
- Strict gates: inner joins only, no hinted joins, every scan must have a
trusted row count,
configurable join-count cap (`joinReorderMaxJoins`, default 10), and any
failure falls back to
the un-reordered plan (a reorder problem must never fail a query).
**Independent fixes found along the way** (can be extracted to a separate PR
on request)
- `-configFile` was silently dropped by quickstarts overriding
`getConfigOverrides()`
(`MultistageEngineQuickStart`, `RealtimeQuickStart`,
`TimeSeriesEngineQuickStart`).
- Multiple `-bootstrapTableDir` directories were not bootstrapped despite
the option declaring
arity `1..*`.
## Measured impact
TPC-H SF=1 (quickstart, 150 samples per variant, median with bootstrap 95%
CI), queries written
in a poor syntactic join order:
| query | literal SQL order | with `useJoinReorder` | hand-optimized SQL |
|---|---|---|---|
| `customer⋈orders⋈lineitem` (filtered) | 549 ms [516, 570] | 462 ms [438,
479] | 412 ms [390, 430] |
| `region⋈nation⋈supplier⋈lineitem` (filtered) | 253 ms [244, 277] | **80 ms
[80, 81]** | 207 ms [204, 211] |
The second query's reordered plan (reduce the dimension chain first, single
probe over the fact
table) beats even the hand-optimized left-deep SQL by 2.6×. Results are
identical across all
variants; with the option disabled, plans are byte-identical to master
(verified against the
full plan-resource test suite).
## Testing
- ~110 new unit tests across the stats store (incl. corruption recovery and
concurrency
semantics), table-type stat semantics, metadata provider/selectivity
(incl. bound-conversion
regression tests), cost ordering, and plan-level join-reorder tests
(EXPLAIN diffs, hint veto,
join-cap fallback, option plumbing).
- Full `pinot-query-planner` module green (1339 tests), including the 574
resource-based plan
tests confirming default-off changes nothing.
## Notes for reviewers
- Enabling `pinot.broker.stats.enabled` changes cardinality **estimates**
visible to all MSE
planner rules (documented on the config key); it never affects
correctness. The join-reorder
phase is additionally gated by its own query option.
- New broker dependencies: `org.xerial:sqlite-jdbc`,
`org.flywaydb:flyway-core` (8.x — the last
line with built-in SQLite support). `LICENSE-binary` updated.
- `StatsStore` is an interface; SQLite is the default implementation and the
storage engine is
swappable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]