friendlymatthew opened a new pull request, #17129: URL: https://github.com/apache/datafusion/pull/17129
## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/17077 ## Rationale for this change This PR modifies `DataSourceExec::try_swapping_with_projection` to preserve equivalence properties when creating a new `DataSourceExec`. Datafusion was losing equivalence properties when projection pushdown occurred after filter pushdown. Consider this example: ```sql COPY ( SELECT '00000000000000000000000000000001' AS trace_id, '2023-10-01 00:00:00'::timestamptz AS start_timestamp, 'prod' as deployment_environment ) TO 'data/1.parquet'; COPY ( SELECT '00000000000000000000000000000002' AS trace_id, '2024-10-01 00:00:00'::timestamptz AS start_timestamp, 'staging' as deployment_environment ) TO 'data/2.parquet'; CREATE EXTERNAL TABLE t1 STORED AS PARQUET LOCATION 'data/'; SET datafusion.execution.parquet.pushdown_filters = true; SELECT deployment_environment FROM t1 WHERE trace_id = '00000000000000000000000000000002' ORDER BY start_timestamp, trace_id; /* SanityCheckPlan caused by Error during planning: Plan: ["SortPreservingMergeExec: [start_timestamp@1 ASC NULLS LAST, trace_id@2 ASC NULLS LAST]", " SortExec: expr=[start_timestamp@1 ASC NULLS LAST], preserve_partitioning=[true]", " DataSourceExec: file_groups={2 groups: [[Users/adriangb/GitHub/datafusion/data/1.parquet], [Users/adriangb/GitHub/datafusion/data/2.parquet]]}, projection=[deployment_environment, start_timestamp, trace_id], file_type=parquet, predicate=trace_id@0 = 00000000000000000000000000000002, pruning_predicate=trace_id_null_count@2 != row_count@3 AND trace_id_min@0 <= 00000000000000000000000000000002 AND 00000000000000000000000000000002 <= trace_id_max@1, required_guarantees=[trace_id in (00000000000000000000000000000002)]"] does not satisfy order requirements: [start_timestamp@1 ASC NULLS LAST, trace_id@2 ASC NULLS LAST]. Child-0 order: [[start_timestamp@1 ASC NULLS LAST]] */ ``` 1. Filter pushdown would create equivalence properties indicating `trace_id` is a constant 2. Projection pushdown would create a new `DataSourceExec` that lost this cached information 3. This caused the `SanityCheckPlan` to fail because it couldn't determine that ordering requirements were satisfied -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org