rgehan opened a new issue, #18380:
URL: https://github.com/apache/datafusion/issues/18380

   ### Is your feature request related to a problem or challenge?
   
   When you have a `UNION` over mostly sorted inputs and explicitly add sorts 
to the unsorted ones, the `enforce_sorting` optimizer removes those targeted 
sorts and moves the sort to the top level instead.
   
   Given a plan like:
   ```
   UnionExec
     ParquetExec (already sorted by col1, col2)
     SortExec (col1, col2)
       ParquetExec (unsorted)
     ParquetExec (already sorted by col1, col2)
   ```
   
   The optimizer produces:
   ```
   SortExec (col1, col2)
     UnionExec
       ParquetExec (sorted)
       ParquetExec (unsorted)
       ParquetExec (sorted)
   ```
   
   This re-sorts all data instead of just the unsorted partition, which 
prevents usage of streaming operators (e.g. `SortPreservingMergeExec`), 
increases memory usage / spilling significantly.
   
   This turns what should be a small parallel sort into a memory-intensive / 
spilling sort of the entire dataset.
   
   
   ### Describe the solution you'd like
   
   Sorts below a `UnionExec` should be preferred over a top-level sort.
   
   In #9867, @NGA-TRAN proposed explicitly implementing 
`required_input_ordering` in `UnionExec`, which seems to fix the reproducer 
tests I added in #18352. It however breaks other unit tests.
   
   ### Describe alternatives you've considered
   
   - Pre-sorting all data, before feeding it to `datafusion`
   - Implementing a custom sort operator that wouldn't get optimized out
   
   While these are viable workarounds, they are not ideal, and I believe 
`datafusion` should be able to handle this case.
   
   
   ### Additional context
   
   Reproducer tests in PR #18352.
   
   Related to issue #9898 and its corresponding PR #9867.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to