zhuqi-lucas commented on PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2804582604
Thank you @2010YOUY01 @Dandandan , it's very interesting, i am thinking:
1. Since the all batch size sum is fixed, we can first calculate the compute
size of each partition, call it partition_cal_size.
2. Then we setting a min_sort_size and max_sort_size, so we will determine
the final_merged_batch_size:
```rust
final_merged_batch_size =
if (partition_cal_size < min_sort_size) => min_sort_size
else if (partition_cal_size > max_sort_size) => max_sort_size
else => partition_cal_size
```
This prevents creating too many small batches (which can fragment merge
tasks) or overly large batches.
It looks like the first version of heuristic
> > > I think for `ExternalSorter` we don't want any additional parallelism
as the sort is already executed per partition (so additional parallelism is
likely to hurt rather than help).
> >
> >
> > In this case, the final merging might become the bottleneck, because SPM
does not have internal parallelism either, during the final merge only 1 core
is busy. I think 2 stages of sort-preserving merge is still needed, becuase
`ExternalSorter` is blocking, but `SPM` is not, this setup can keep all the
cores busy after partial sort is finished. We just have to ensure they don't
have a very large merge degree to become slow (with the optimizations like this
PR)
>
> Yes, to be clear I don't argue to remove SortPreservingMergeExec or
sorting in two fases altogether or something similar, just was reacting to the
idea of adding more parallelism in `in_mem_sort_stream` which probably won't
help much.
>
> ```
> SortPreserveMergeExec <= Does k-way merging based on input streams, with
minimal memory overhead, maximizing input parallelism
> SortExec partitions[1,2,3,4,5,6,7,8,9,10] <= Performs in memory
*sorting* if possible, for each input partition in parallel, only resorting to
spill/merge when does not fit into memory
> ```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]