zhuqi-lucas opened a new pull request, #21266: URL: https://github.com/apache/datafusion/pull/21266
## Which issue does this PR close? Related to https://github.com/apache/datafusion/issues/17348 Precursor to https://github.com/apache/datafusion/pull/21182 ## Rationale for this change The sort pushdown benchmark (#21213) uses TPC-H data where file names happen to match sort key order, so the optimization shows no difference vs. main ([comment](https://github.com/apache/datafusion/pull/21182#issuecomment-4158740710)). This PR generates custom benchmark data with **reversed file names** so the optimization is required to achieve sort elimination: ``` c_high.parquet: l_orderkey 1-200k (c sorts last alphabetically, but has lowest keys) b_mid.parquet: l_orderkey 200k-400k a_low.parquet: l_orderkey 400k+ (a sorts first alphabetically, but has highest keys) ``` **On main (without optimization)**: - Alphabetical file order: `[a_low(400k+), b_mid(200k-400k), c_high(1-200k)]` - `validated_output_ordering()` sees files out of order → strips ordering - SortExec stays → slower **With optimization (#21182)**: - `sort_files_within_groups_by_statistics()` reorders to `[c_high, b_mid, a_low]` - Files non-overlapping → ordering valid → SortExec eliminated → faster ## What changes are included in this PR? - New `data_sort_pushdown` function in `bench.sh` that uses `datafusion-cli` to split TPC-H lineitem data into 3 sorted parquet files with reversed naming - Updated `run_sort_pushdown` / `run_sort_pushdown_sorted` to use the custom data path ## Test plan - [x] `cargo clippy -p datafusion-benchmarks` — 0 warnings - [x] Local benchmark shows sort elimination with optimization PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
