suremarc opened a new issue, #8838: URL: https://github.com/apache/arrow-datafusion/issues/8838
### Describe the bug DataFusion is unable to eliminate multi-column sorts when the major column is a one-to-one and monotonic expression of a sorted input column: * In certain cases when the expression is not one-to-one, eliminating the sort is invalid (e.g. sorting by `floor(x), y` is not equivalent to sorting by `x, y`). * However, casting from `Int32` to `Int64` is one-to-one and monotonic, so DataFusion should be able to avoid sorting in such a case. See below. ### To Reproduce Data acquired from [`parquet_testing/data/delta_encoding_required_column_expect.csv`](https://github.com/apache/parquet-testing/blob/4cb3cff24c965fb329cdae763eabce47395a68a0/data/delta_encoding_required_column_expect.csv) (I couldn't get `datafusion-cli` to work with the parquet file for some reason). ```sql CREATE EXTERNAL TABLE delta_encoding_required_column ( c_customer_sk INT NOT NULL, c_current_cdemo_sk INT NOT NULL ) STORED AS CSV WITH HEADER ROW WITH ORDER ( c_customer_sk DESC, c_current_cdemo_sk DESC ) LOCATION 'delta_encoding_required_column_expect.csv'; ``` ```sql -- sort required here EXPLAIN SELECT CAST(c_customer_sk AS BIGINT) AS c_customer_sk_big, c_current_cdemo_sk FROM delta_encoding_required_column ORDER BY c_customer_sk_big DESC, c_current_cdemo_sk DESC; ``` Resulting physical plan has a `SortExec`: ``` +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: c_customer_sk_big DESC NULLS FIRST, delta_encoding_required_column.c_current_cdemo_sk DESC NULLS FIRST | | | Projection: CAST(delta_encoding_required_column.c_customer_sk AS Int64) AS c_customer_sk_big, delta_encoding_required_column.c_current_cdemo_sk | | | TableScan: delta_encoding_required_column projection=[c_customer_sk, c_current_cdemo_sk] | | physical_plan | SortPreservingMergeExec: [c_customer_sk_big@0 DESC,c_current_cdemo_sk@1 DESC] | | | SortExec: expr=[c_customer_sk_big@0 DESC,c_current_cdemo_sk@1 DESC] | | | ProjectionExec: expr=[CAST(c_customer_sk@0 AS Int64) as c_customer_sk_big, c_current_cdemo_sk@1 as c_current_cdemo_sk] | | | RepartitionExec: partitioning=RoundRobinBatch(32), input_partitions=1 | | | CsvExec: file_groups={1 group: [[home/suremarc/rust-app-atlas/delta_encoding_required_column.csv]]}, projection=[c_customer_sk, c_current_cdemo_sk], output_ordering=[c_customer_sk@0 DESC, c_current_cdemo_sk@1 DESC], has_header=true | | | | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` ```sql -- sort NOT required here -- notice we have a single sorting column EXPLAIN SELECT CAST(c_customer_sk AS BIGINT) AS c_customer_sk_big FROM delta_encoding_required_column ORDER BY c_customer_sk_big DESC; ``` Resulting physical plan has no `SortExec`: ``` +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: c_customer_sk_big DESC NULLS FIRST | | | Projection: CAST(delta_encoding_required_column.c_customer_sk AS Int64) AS c_customer_sk_big | | | TableScan: delta_encoding_required_column projection=[c_customer_sk] | | physical_plan | SortPreservingMergeExec: [c_customer_sk_big@0 DESC] | | | ProjectionExec: expr=[CAST(c_customer_sk@0 AS Int64) as c_customer_sk_big] | | | RepartitionExec: partitioning=RoundRobinBatch(32), input_partitions=1 | | | CsvExec: file_groups={1 group: [[home/suremarc/rust-app-atlas/delta_encoding_required_column.csv]]}, projection=[c_customer_sk], output_ordering=[c_customer_sk@0 DESC], has_header=true | | | | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` ### Expected behavior The first query provided above should not require a sort ### Additional context I encountered this bug when trying to re-cast the timezone of a table ordered by both `timestamp` and a secondary `ticker` column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
