suremarc opened a new issue, #8838:
URL: https://github.com/apache/arrow-datafusion/issues/8838

   ### Describe the bug
   
   DataFusion is unable to eliminate multi-column sorts when the major column 
is a one-to-one and monotonic expression of a sorted input column:
   * In certain cases when the expression is not one-to-one, eliminating the 
sort is invalid (e.g. sorting by `floor(x), y` is not equivalent to sorting by 
`x, y`). 
   * However, casting from `Int32` to `Int64` is one-to-one and monotonic, so 
DataFusion should be able to avoid sorting in such a case. See below. 
   
   ### To Reproduce
   
   Data acquired from 
[`parquet_testing/data/delta_encoding_required_column_expect.csv`](https://github.com/apache/parquet-testing/blob/4cb3cff24c965fb329cdae763eabce47395a68a0/data/delta_encoding_required_column_expect.csv)
 (I couldn't get `datafusion-cli` to work with the parquet file for some 
reason).
   
   ```sql
   CREATE EXTERNAL TABLE delta_encoding_required_column (
       c_customer_sk INT NOT NULL,
       c_current_cdemo_sk INT NOT NULL
   )
   STORED AS CSV WITH HEADER ROW
   WITH ORDER (
       c_customer_sk DESC,
       c_current_cdemo_sk DESC
   )
   LOCATION 'delta_encoding_required_column_expect.csv';
   ```
   
   ```sql
   -- sort required here
   EXPLAIN
   SELECT 
       CAST(c_customer_sk AS BIGINT) AS c_customer_sk_big,
       c_current_cdemo_sk
   FROM delta_encoding_required_column
   ORDER BY c_customer_sk_big DESC, c_current_cdemo_sk DESC;
   ```
   Resulting physical plan has a `SortExec`:
   ```
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                     |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: c_customer_sk_big DESC NULLS FIRST, 
delta_encoding_required_column.c_current_cdemo_sk DESC NULLS FIRST              
                                                                                
                                      |
   |               |   Projection: 
CAST(delta_encoding_required_column.c_customer_sk AS Int64) AS 
c_customer_sk_big, delta_encoding_required_column.c_current_cdemo_sk            
                                                                                
   |
   |               |     TableScan: delta_encoding_required_column 
projection=[c_customer_sk, c_current_cdemo_sk]                                  
                                                                                
                                  |
   | physical_plan | SortPreservingMergeExec: [c_customer_sk_big@0 
DESC,c_current_cdemo_sk@1 DESC]                                                 
                                                                                
                                  |
   |               |   SortExec: expr=[c_customer_sk_big@0 
DESC,c_current_cdemo_sk@1 DESC]                                                 
                                                                                
                                          |
   |               |     ProjectionExec: expr=[CAST(c_customer_sk@0 AS Int64) 
as c_customer_sk_big, c_current_cdemo_sk@1 as c_current_cdemo_sk]               
                                                                                
                       |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(32), 
input_partitions=1                                                              
                                                                                
                       |
   |               |         CsvExec: file_groups={1 group: 
[[home/suremarc/rust-app-atlas/delta_encoding_required_column.csv]]}, 
projection=[c_customer_sk, c_current_cdemo_sk], 
output_ordering=[c_customer_sk@0 DESC, c_current_cdemo_sk@1 DESC], 
has_header=true |
   |               |                                                            
                                                                                
                                                                                
                     |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```
   ```sql
   -- sort NOT required here
   -- notice we have a single sorting column
   EXPLAIN
   SELECT 
       CAST(c_customer_sk AS BIGINT) AS c_customer_sk_big
   FROM delta_encoding_required_column
   ORDER BY c_customer_sk_big DESC;
   ```
   Resulting physical plan has no `SortExec`:
   ```
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                    |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: c_customer_sk_big DESC NULLS FIRST                   
                                                                                
                                                    |
   |               |   Projection: 
CAST(delta_encoding_required_column.c_customer_sk AS Int64) AS 
c_customer_sk_big                                                               
                                  |
   |               |     TableScan: delta_encoding_required_column 
projection=[c_customer_sk]                                                      
                                                                 |
   | physical_plan | SortPreservingMergeExec: [c_customer_sk_big@0 DESC]        
                                                                                
                                                    |
   |               |   ProjectionExec: expr=[CAST(c_customer_sk@0 AS Int64) as 
c_customer_sk_big]                                                              
                                                     |
   |               |     RepartitionExec: partitioning=RoundRobinBatch(32), 
input_partitions=1                                                              
                                                        |
   |               |       CsvExec: file_groups={1 group: 
[[home/suremarc/rust-app-atlas/delta_encoding_required_column.csv]]}, 
projection=[c_customer_sk], output_ordering=[c_customer_sk@0 DESC], 
has_header=true |
   |               |                                                            
                                                                                
                                                    |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```
   
   
   ### Expected behavior
   
   The first query provided above should not require a sort
   
   ### Additional context
   
   I encountered this bug when trying to re-cast the timezone of a table 
ordered by both `timestamp` and a secondary `ticker` column. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to