[I] Follow-ups for stats-based RG / file reorder (#21956): multi-column, function-wrapped, compound exprs [datafusion]

via GitHub Fri, 15 May 2026 01:46:28 -0700


zhuqi-lucas opened a new issue, #22198:
URL: https://github.com/apache/datafusion/issues/22198


   Follow-up enhancements to the row-group / file stats-based reorder landed in 
#21956. Listed here so they can be picked up independently.
   
   Today `PreparedAccessPlan::reorder_by_statistics` and 
`ParquetSource::reorder_files` both:
   - only look at `sort_order.first()` — secondary sort keys are ignored
   - only fire when the leading expression is a plain `Column` — 
function-wrapped sorts (`date_trunc('day', ts) DESC`, `CAST(x AS Date) DESC`, 
`ceil(value) DESC`, …) skip the stats-based reorder
   
   Three follow-ups, by priority:
   
   ---
   
   ### P0 — Multi-column lexicographic reorder by per-column min
   
   For `ORDER BY a ASC, b DESC LIMIT N` where both `a` and `b` are plain 
columns in the file schema, lex-sort row groups (and files) by `(min(a) ASC, 
min(b) DESC)` instead of just `min(a) ASC`. Parquet stores per-column stats, so 
the data is already there — switch from `arrow::compute::sort_to_indices` over 
a single `stat_mins` to `lexsort_to_indices` over per-column min arrays, with 
per-column `SortOptions`.
   
   Most useful when the leading sort column has low cardinality (e.g. `ORDER BY 
country, ts DESC LIMIT 100` — RGs that tie on `country.min` should be broken by 
`ts.min`).
   
   Estimated cost: ~30-50 lines + a few unit tests.
   
   ### P1 — Function-wrapped reorder via leaf-column extraction
   
   For sort expressions like `date_trunc('day', ts) DESC` or `CAST(x AS Date) 
DESC` — single leaf `Column` wrapped in a monotonic function chain. Parquet has 
no stats keyed by the function expression, but `min(f(col)) = f(min(col))` when 
`f` is monotonic, so `min(col)`'s order is a valid proxy.
   
   Approach:
   1. Walk the sort expression's `PhysicalExpr::children()` to find the unique 
leaf `Column`.
   2. Verify monotonicity through the wrapper chain — `EquivalenceProperties` 
already does this in `ordering_satisfy`; the same machinery should be reusable 
here.
   3. Drive `reorder_by_statistics` off the leaf column's stats.
   
   DataFusion's existing `evaluate_bounds` / monotonicity inference is the 
right tool — happy to discuss the cleanest hook point.
   
   ### P2 — Compound-expression reorder via interval propagation (probably 
defer)
   
   For sorts on expressions involving more than one column (`a + b DESC`, `a * 
2 - b DESC`, …) parquet stats give per-column intervals; combining them needs 
operator-specific interval arithmetic (`[min(a)+min(b), max(a)+max(b)]` for 
`+`, sign-aware for `*`, etc.). DataFusion has the building blocks 
(`evaluate_bounds`), but real-world workloads rarely sort by compound 
expressions and the integration cost is non-trivial. Listed for completeness; 
probably not worth doing unless a concrete benchmark motivates it.
   
   ---
   
   cc @adriangb @alamb — wanted to keep the parent PR focused on the core 
mechanism + the review feedback you raised; these were good directions that 
came out of the design discussion but seemed better as separate issues.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Follow-ups for stats-based RG / file reorder (#21956): multi-column, function-wrapped, compound exprs [datafusion]

Reply via email to