friendlymatthew opened a new pull request, #17129:
URL: https://github.com/apache/datafusion/pull/17129

   ## Which issue does this PR close?
   
   - Closes https://github.com/apache/datafusion/issues/17077
   
   ## Rationale for this change
   
   This PR modifies `DataSourceExec::try_swapping_with_projection` to preserve 
equivalence properties when creating a new `DataSourceExec`.
   
   Datafusion was losing equivalence properties when projection pushdown 
occurred after filter pushdown. 
   
   Consider this example: 
   ```sql
   COPY (
       SELECT
           '00000000000000000000000000000001' AS trace_id,
           '2023-10-01 00:00:00'::timestamptz AS start_timestamp,
           'prod' as deployment_environment
   )
   TO 'data/1.parquet';
   
   COPY (
       SELECT
           '00000000000000000000000000000002' AS trace_id,
           '2024-10-01 00:00:00'::timestamptz AS start_timestamp,
           'staging' as deployment_environment
   )
   TO 'data/2.parquet';
   
   CREATE EXTERNAL TABLE t1 STORED AS PARQUET LOCATION 'data/';
   
   SET datafusion.execution.parquet.pushdown_filters = true;
   
   SELECT deployment_environment
   FROM t1
   WHERE trace_id = '00000000000000000000000000000002'
   ORDER BY start_timestamp, trace_id;
   
   /*
   SanityCheckPlan
   caused by
   Error during planning: 
   Plan: ["SortPreservingMergeExec: [start_timestamp@1 ASC NULLS LAST, 
trace_id@2 ASC NULLS LAST]", "  SortExec: expr=[start_timestamp@1 ASC NULLS 
LAST], preserve_partitioning=[true]", "    DataSourceExec: file_groups={2 
groups: [[Users/adriangb/GitHub/datafusion/data/1.parquet], 
[Users/adriangb/GitHub/datafusion/data/2.parquet]]}, 
projection=[deployment_environment, start_timestamp, trace_id], 
file_type=parquet, predicate=trace_id@0 = 00000000000000000000000000000002, 
pruning_predicate=trace_id_null_count@2 != row_count@3 AND trace_id_min@0 <= 
00000000000000000000000000000002 AND 00000000000000000000000000000002 <= 
trace_id_max@1, required_guarantees=[trace_id in 
(00000000000000000000000000000002)]"] does not satisfy order requirements: 
[start_timestamp@1 ASC NULLS LAST, trace_id@2 ASC NULLS LAST]. Child-0 order: 
[[start_timestamp@1 ASC NULLS LAST]]
   */
   ```
   
   1. Filter pushdown would create equivalence properties indicating `trace_id` 
is a constant
   2. Projection pushdown would create a new `DataSourceExec` that lost this 
cached information
   3. This caused the `SanityCheckPlan` to fail because it couldn't determine 
that ordering requirements were satisfied


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to