gortiz opened a new issue, #18740:
URL: https://github.com/apache/pinot/issues/18740
# [Umbrella] Cost-based optimization for the multi-stage query engine
## Motivation
The multi-stage query engine plans queries with Apache Calcite, but
optimization today is purely
rule-based (`HepPlanner`): join order is whatever the SQL says, join
strategies are chosen via
explicit hints, and the planner has no cardinality information (`PinotTable`
does not expose a
Calcite `Statistic`; there is a long-standing `TODO: add support for cost
factory` in
`QueryEnvironment`). Calcite already ships the machinery for cost-based
decisions — what Pinot is
missing is **statistics at the broker** and the integration to consume them.
This issue tracks the work to introduce cost-based optimization
incrementally, gated and off by
default at every step.
## Why it matters (measured)
On TPC-H SF=1 (single-server quickstart, 150 samples per variant, medians
with bootstrap 95% CIs),
for queries deliberately written in a poor syntactic join order (smallest
tables first):
| query | literal SQL order | with cost-based reorder | hand-optimized SQL |
|---|---|---|---|
| `customer⋈orders⋈lineitem` (filtered) | 549 ms [516, 570] | 462 ms [438,
479] | 412 ms [390, 430] |
| `region⋈nation⋈supplier⋈lineitem` (filtered) | 253 ms [244, 277] | **80 ms
[80, 81]** | 207 ms [204, 211] |
In the second query the optimizer found a bushy plan (reduce the dimension
chain first, then a
single probe over the fact table) that is **2.6× faster than the
hand-optimized left-deep SQL**
and 3.2× faster than the literal order. Results are identical in all
variants, and the reorder
phase adds no measurable planning overhead when it decides to change nothing.
## Design overview
Three pillars (detailed design in the linked doc / PR descriptions):
1. **Broker statistics subsystem.** Per-segment stats collected from
ZooKeeper metadata the broker
already watches (row count, size, time boundaries — effectively free),
persisted off-heap in an
embedded SQLite store (bounded heap, warm restart, crc-based
reconciliation). Per-column stats
(NDV, min/max, avg byte size) come later via a bounded server fan-out
behind a swappable
`ColumnStatsSource`. Every stat carries a **confidence** tier so that
table types where raw
doc counts are biased (upsert/dedup, consuming segments, hybrid
time-boundary overlap) degrade
to today's behavior instead of producing silently-wrong plans.
2. **Cost model.** Rows-dominated for logical join ordering; extended later
with columnar byte
sizes for exchange-aware decisions (shuffle/broadcast cost is the
dominant term in MSE).
3. **Planner integration.** `PinotTable#getStatistic()` + a chained
`RelMetadataProvider`
(row counts, selectivity incl. time-range estimation from segment time
boundaries) feed
Calcite's `RelMetadataQuery`; a gated join-reorder phase
(`useJoinReorder` query option,
default off) runs between the logical and physical planning phases with
strict eligibility
gates (inner joins only, no hinted joins, all scans must have trusted row
counts, join-count
cap, fall back to the original plan on any error).
## Phase 1 — statistics foundation + join reordering (implemented, to be
split into PRs)
- [ ] PR: statistics contracts (`PinotStatisticsProvider`,
`TableStatistics`, `ColumnStatistics`,
`StatConfidence`; broker-side `StatsStore` / `ColumnStatsSource` SPIs)
- [ ] PR: SQLite-backed `StatsStore` (WAL, Flyway-migrated schema,
corruption auto-recovery)
- [ ] PR: broker T0 stats collection from ZK segment metadata
(`pinot.broker.stats.enabled`,
default false) + table-type semantics (hybrid merge at the time
boundary, upsert/dedup and
consuming-segment confidence)
- [ ] PR: planner wiring (`PinotTable#getStatistic()`, statistics provider
through
`QueryEnvironment.Config`)
- [ ] PR: `RelMetadataProvider` with stats-backed row counts and selectivity
(incl. time-range
selectivity from segment time boundaries)
- [ ] PR: rows-dominated `RelOptCost` implementation
- [ ] PR: gated cost-based join-reorder phase + guardrails (join cap, hint
veto, error fallback)
and plan-level tests
- [ ] PR (independent): quickstart fixes discovered along the way
(`-configFile` dropped by some
quickstarts; multiple `-bootstrapTableDir` support)
## Phase 2 — column statistics and standalone payoffs
- [ ] Server endpoint audit/extension for lean per-segment column stats
(NDV, min/max, avg bytes)
- [ ] `ColumnStatsSource` broker-pull implementation (rate-limited,
jittered, debounced on
rebalance storms; crc-delta fetch)
- [ ] Selectivity refinement: NDV-based equality selectivity, min/max range
selectivity,
null-sentinel handling (numeric null default pollutes min — track
per-column trust)
- [ ] Broker-side min/max segment pruning for the single-stage engine (the
per-column min/max in
the off-heap store removes the historical reason this lives only on
servers — see the
existing `TODO` in `ColumnValueSegmentPruner`)
- [ ] Stats observability: broker metrics (stats age, store size, fallback
reasons), EXPLAIN
annotations showing the estimates used, admin purge endpoint
## Phase 3 — cost-based physical optimization
The current physical optimizer applies a well-defined sequence of
transformation rules and has
served the engine well; by construction it produces a single plan rather
than comparing
alternatives by cost. Two tracks:
- [ ] Short term: **build-side normalization** — use row counts to place the
smaller input on the
hash-build side of a join. Benchmarks above show the logical reorderer
minimizes
intermediate cardinality but cannot account for the engine's
build-side convention (the
remaining 462 vs 412 ms gap in query 1); this is a small,
well-contained improvement.
- [ ] Longer term: a design proposal for a cost-based physical planning
phase — a two-phase
search (bottom-up enumeration of server/distribution candidates per
operator, top-down
worker assignment) where colocation, broadcast-vs-hash, aggregate
placement and parallelism
become cost-driven decisions sharing the same statistics and cost
model. This deserves its
own design document and discussion with the community before any code.
## Cross-cutting
- [ ] Vendor extensibility: statistics source, cost model and rule sets
pluggable via the existing
`RuleSetCustomizer`-style SPI pattern
- [ ] Documentation (config reference, design notes, operations guide for
the stats store)
## Compatibility and safety
- Everything is **off by default** (`pinot.broker.stats.enabled=false`;
`useJoinReorder` query
option default false) and broker-local — no wire-format or mixed-version
concerns.
- Stats-store failures never fail a query: all read paths degrade to the
no-stats behavior.
- Note for reviewers: enabling stats collection changes the cardinality
*estimates* visible to
all MSE planner rules (never correctness); the join-reorder phase is
additionally gated by its
own option.
- New dependencies: `org.xerial:sqlite-jdbc`, `org.flywaydb:flyway-core`
(broker only,
LICENSE-binary updated).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]