zhuqi-lucas opened a new issue, #22198:
URL: https://github.com/apache/datafusion/issues/22198
Follow-up enhancements to the row-group / file stats-based reorder landed in
#21956. Listed here so they can be picked up independently.
Today `PreparedAccessPlan::reorder_by_statistics` and
`ParquetSource::reorder_files` both:
- only look at `sort_order.first()` — secondary sort keys are ignored
- only fire when the leading expression is a plain `Column` —
function-wrapped sorts (`date_trunc('day', ts) DESC`, `CAST(x AS Date) DESC`,
`ceil(value) DESC`, …) skip the stats-based reorder
Three follow-ups, by priority:
---
### P0 — Multi-column lexicographic reorder by per-column min
For `ORDER BY a ASC, b DESC LIMIT N` where both `a` and `b` are plain
columns in the file schema, lex-sort row groups (and files) by `(min(a) ASC,
min(b) DESC)` instead of just `min(a) ASC`. Parquet stores per-column stats, so
the data is already there — switch from `arrow::compute::sort_to_indices` over
a single `stat_mins` to `lexsort_to_indices` over per-column min arrays, with
per-column `SortOptions`.
Most useful when the leading sort column has low cardinality (e.g. `ORDER BY
country, ts DESC LIMIT 100` — RGs that tie on `country.min` should be broken by
`ts.min`).
Estimated cost: ~30-50 lines + a few unit tests.
### P1 — Function-wrapped reorder via leaf-column extraction
For sort expressions like `date_trunc('day', ts) DESC` or `CAST(x AS Date)
DESC` — single leaf `Column` wrapped in a monotonic function chain. Parquet has
no stats keyed by the function expression, but `min(f(col)) = f(min(col))` when
`f` is monotonic, so `min(col)`'s order is a valid proxy.
Approach:
1. Walk the sort expression's `PhysicalExpr::children()` to find the unique
leaf `Column`.
2. Verify monotonicity through the wrapper chain — `EquivalenceProperties`
already does this in `ordering_satisfy`; the same machinery should be reusable
here.
3. Drive `reorder_by_statistics` off the leaf column's stats.
DataFusion's existing `evaluate_bounds` / monotonicity inference is the
right tool — happy to discuss the cleanest hook point.
### P2 — Compound-expression reorder via interval propagation (probably
defer)
For sorts on expressions involving more than one column (`a + b DESC`, `a *
2 - b DESC`, …) parquet stats give per-column intervals; combining them needs
operator-specific interval arithmetic (`[min(a)+min(b), max(a)+max(b)]` for
`+`, sign-aware for `*`, etc.). DataFusion has the building blocks
(`evaluate_bounds`), but real-world workloads rarely sort by compound
expressions and the integration cost is non-trivial. Listed for completeness;
probably not worth doing unless a concrete benchmark motivates it.
---
cc @adriangb @alamb — wanted to keep the parent PR focused on the core
mechanism + the review feedback you raised; these were good directions that
came out of the design discussion but seemed better as separate issues.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]