2010YOUY01 commented on issue #15375:
URL: https://github.com/apache/datafusion/issues/15375#issuecomment-2747654525
This is a great observation, and the POC optimization has a high ROI. Here
are some additional thoughts:
This is just intuition, but I think the sorting phase should be faster than
the sort-preserving merge phase. Because the sorting implementation is much
simpler, it can rely on the existing optimized arrow row format and also quick
sort implementation from the standard library. On the contrary, merging phase
we have an in-house implementation of a loser tree heap, I think it's a bit
complex so maybe it is also hard to optimize manually.
As a result, perhaps it's preferred to put more work to be done during
sorting, and leave minimal work to be done during merging (however, merging is
still needed inside partial sort, because it can keep all the cores busy when
partial sort is all finished)
## Example
There is a sort query to run in 4 partitions, and each partition will
process 100 input batches
- Current implementation:
- Inside partial sort, each batch will be sorted, and a 100-way merge
will be contructed.
- In the final stage, a 4-way merge will produce the final result.
- Putting most work in sort strategy: (we can control the partial sort and
final sort to have the same merge degree, so they can process data at around
the same speed)
- Inside partial sort, sort 25 batches into a single sorted run at once,
and after that do a 4-way merge as the output of partial sort
- In the final stage, there is the same 4-way merge to produce the final
result.
## Implementation
The POC will copy the batches with `cocnat()` than do a bigger sort, an
alternative to try to avoid copies is: first sort all elements' indices
(2-level index consists of `(batch_idx, row_idx)`), and get a permutation array.
Use the interleave kernel to construct the final result
https://docs.rs/arrow/latest/arrow/compute/kernels/interleave/fn.interleave.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]