alamb opened a new issue, #16899:
URL: https://github.com/apache/datafusion/issues/16899

   ### Describe the bug
   
   When data is sorted on a prefix, but not all, of the input columns I expect 
DataFusion to use the faster / more memory efficient operator 
`PartialSortExec`: 
https://github.com/apache/datafusion/blob/07516aa4a369aceb2ef2438fa81e5a87691f76a6/datafusion/physical-plan/src/sorts/partial_sort.rs#L80
   
   However, this does not appear to be happening
   
   ### To Reproduce
   
   ```
   > copy (values(1, 'a'), (2,'b'), (3,'d'), (4,'a')) to '/tmp/order.csv';
   +-------+
   | count |
   +-------+
   | 4     |
   +-------+
   1 row(s) fetched.
   Elapsed 0.005 seconds.
   
   
   > create external table order stored as csv location '/tmp/order.csv' with 
order (column1 asc);
   0 row(s) fetched.
   Elapsed 0.002 seconds.
   ```
   
   When ordering by just `column1` (which is the declared table order)  we can 
see the plan correctly avoids sorting 🎉 
   
   ```sql
   > explain select * from order ORDER BY column1;
   +---------------+-------------------------------+
   | plan_type     | plan                          |
   +---------------+-------------------------------+
   | physical_plan | ┌───────────────────────────┐ |
   |               | │       DataSourceExec      │ |
   |               | │    --------------------   │ |
   |               | │          files: 1         │ |
   |               | │        format: csv        │ |
   |               | └───────────────────────────┘ |
   |               |                               |
   +---------------+-------------------------------+
   1 row(s) fetched.
   Elapsed 0.009 seconds.
   ```
   
   However, when ordering by `column1, column2` (the table is a prefix of the 
declared table order)  we can see the plan uses a SortExec (which resorts the 
entire input, rather than just sorting within batches):
   
   ```sql
   > explain select * from order ORDER BY column1, column2;
   +---------------+-------------------------------+
   | plan_type     | plan                          |
   +---------------+-------------------------------+
   | physical_plan | ┌───────────────────────────┐ |
   |               | │          SortExec         │ |
   |               | │    --------------------   │ |
   |               | │ column1@0 ASC NULLS LAST, │ |
   |               | │  column2@1 ASC NULLS LAST │ |
   |               | └─────────────┬─────────────┘ |
   |               | ┌─────────────┴─────────────┐ |
   |               | │       DataSourceExec      │ |
   |               | │    --------------------   │ |
   |               | │          files: 1         │ |
   |               | │        format: csv        │ |
   |               | └───────────────────────────┘ |
   |               |                               |
   +---------------+-------------------------------+
   ```
   
   
   ### Expected behavior
   
   I expect  the existing sort order to be used
   
   Specifically, in the query above I expect that `PartialSortExec` is used 
instead:
   
   ```sql
   > explain select * from order ORDER BY column1, column2;
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to