zhuqi-lucas commented on issue #17348:
URL: https://github.com/apache/datafusion/issues/17348#issuecomment-4136167492

   I've submitted a PR that implements the core optimizations described in this 
EPIC:
   
   **PR: https://github.com/apache/datafusion/pull/21182**
   
   ### What's implemented:
   
   1. **Sort elimination (Exact)**: When files are non-overlapping and 
internally sorted (via Parquet `sorting_columns` metadata or `WITH ORDER`), 
`SortExec` is completely removed
   2. **Statistics-based file reordering**: Files within each partition are 
sorted by min/max statistics to approximate the requested order — benefits 
TopK/LIMIT via better dynamic filter pruning
   3. **Multi-partition support**: Per-partition sort elimination with 
`SortPreservingMergeExec` for cheap O(n) merge across partitions; files 
redistributed consecutively to avoid bin-packing interleaving
   4. **Automatic ordering inference**: Works with Parquet `sorting_columns` 
metadata — no `WITH ORDER` needed for sorted Parquet files
   
   ### Benchmark results (sort elimination):
   
   | Query | Description | Baseline | Optimized | Speedup |
   |-------|-------------|----------|-----------|---------|
   | Q1 | `ORDER BY ASC` full scan | 159ms | 91ms | **43%** |
   | Q2 | `ORDER BY ASC LIMIT 100` | 36ms | 12ms | **67%** |
   | Q3 | `SELECT * ORDER BY ASC` | 487ms | 333ms | **31%** |
   | Q4 | `SELECT * ORDER BY ASC LIMIT 100` | 119ms | 30ms | **74%** |
   
   Building on the earlier work by @adriangb in #20304 and the sort pushdown 
infrastructure from #17337.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to