zhuqi-lucas commented on issue #17348: URL: https://github.com/apache/datafusion/issues/17348#issuecomment-4136167492
I've submitted a PR that implements the core optimizations described in this EPIC: **PR: https://github.com/apache/datafusion/pull/21182** ### What's implemented: 1. **Sort elimination (Exact)**: When files are non-overlapping and internally sorted (via Parquet `sorting_columns` metadata or `WITH ORDER`), `SortExec` is completely removed 2. **Statistics-based file reordering**: Files within each partition are sorted by min/max statistics to approximate the requested order — benefits TopK/LIMIT via better dynamic filter pruning 3. **Multi-partition support**: Per-partition sort elimination with `SortPreservingMergeExec` for cheap O(n) merge across partitions; files redistributed consecutively to avoid bin-packing interleaving 4. **Automatic ordering inference**: Works with Parquet `sorting_columns` metadata — no `WITH ORDER` needed for sorted Parquet files ### Benchmark results (sort elimination): | Query | Description | Baseline | Optimized | Speedup | |-------|-------------|----------|-----------|---------| | Q1 | `ORDER BY ASC` full scan | 159ms | 91ms | **43%** | | Q2 | `ORDER BY ASC LIMIT 100` | 36ms | 12ms | **67%** | | Q3 | `SELECT * ORDER BY ASC` | 487ms | 333ms | **31%** | | Q4 | `SELECT * ORDER BY ASC LIMIT 100` | 119ms | 30ms | **74%** | Building on the earlier work by @adriangb in #20304 and the sort pushdown infrastructure from #17337. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
