rgehan opened a new issue, #18380:
URL: https://github.com/apache/datafusion/issues/18380
### Is your feature request related to a problem or challenge?
When you have a `UNION` over mostly sorted inputs and explicitly add sorts
to the unsorted ones, the `enforce_sorting` optimizer removes those targeted
sorts and moves the sort to the top level instead.
Given a plan like:
```
UnionExec
ParquetExec (already sorted by col1, col2)
SortExec (col1, col2)
ParquetExec (unsorted)
ParquetExec (already sorted by col1, col2)
```
The optimizer produces:
```
SortExec (col1, col2)
UnionExec
ParquetExec (sorted)
ParquetExec (unsorted)
ParquetExec (sorted)
```
This re-sorts all data instead of just the unsorted partition, which
prevents usage of streaming operators (e.g. `SortPreservingMergeExec`),
increases memory usage / spilling significantly.
This turns what should be a small parallel sort into a memory-intensive /
spilling sort of the entire dataset.
### Describe the solution you'd like
Sorts below a `UnionExec` should be preferred over a top-level sort.
In #9867, @NGA-TRAN proposed explicitly implementing
`required_input_ordering` in `UnionExec`, which seems to fix the reproducer
tests I added in #18352. It however breaks other unit tests.
### Describe alternatives you've considered
- Pre-sorting all data, before feeding it to `datafusion`
- Implementing a custom sort operator that wouldn't get optimized out
While these are viable workarounds, they are not ideal, and I believe
`datafusion` should be able to handle this case.
### Additional context
Reproducer tests in PR #18352.
Related to issue #9898 and its corresponding PR #9867.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]