adriangb opened a new pull request, #20329:
URL: https://github.com/apache/datafusion/pull/20329

   ## Summary
   
   - **`eq_properties()` in `FileScanConfig` blindly trusted 
`output_ordering`** (set from Parquet `sorting_columns` metadata) without 
verifying that files within a group are in the correct inter-file order
   - `EnforceSorting` then removed `SortExec` based on this unvalidated 
ordering, producing **wrong results** when filesystem order didn't match data 
order
   - Added `validated_output_ordering()` that filters orderings using 
`MinMaxStatistics::new_from_files()` + `is_sorted()` to verify inter-file sort 
order before reporting them to the optimizer
   
   ## Changes
   
   ### `datafusion/datasource/src/file_scan_config.rs`
   - Added `validated_output_ordering()` method on `FileScanConfig` that 
validates each output ordering against actual file group statistics
   - Changed `eq_properties()` to call `self.validated_output_ordering()` 
instead of `self.output_ordering.clone()`
   
   ### `datafusion/sqllogictest/test_files/sort_pushdown.slt`
   Added 8 new regression tests (Tests 4-11):
   
   | Test | Scenario | Key assertion |
   |------|----------|---------------|
   | **4** | Reversed filesystem order (inferred ordering) | SortExec retained 
— wrong inter-file order detected |
   | **5** | Overlapping file ranges (inferred ordering) | SortExec retained — 
overlapping ranges detected |
   | **6** | `WITH ORDER` + reversed filesystem order | SortExec retained 
despite explicit ordering |
   | **7** | Correctly ordered multi-file group (positive) | SortExec 
eliminated — validation passes |
   | **8** | DESC ordering with wrong inter-file DESC order | SortExec retained 
for DESC direction |
   | **9** | Multi-column sort key (overlapping vs non-overlapping) | 
Conservative rejection with overlapping stats; passes with clean boundaries |
   | **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec eliminated 
— both ordering and stats agree |
   | **11** | Multiple partitions (one file per group) | 
`SortPreservingMergeExec` merges; no per-partition sort needed |
   
   ## Test plan
   
   - [x] `cargo test --test sqllogictests -- sort_pushdown` — all new + 
existing tests pass
   - [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc tests 
pass
   - [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`) still 
eliminates SortExec (no regression)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to