rluvaton commented on code in PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#discussion_r2041170126
##########
datafusion/common/src/config.rs:
##########
@@ -337,6 +337,13 @@ config_namespace! {
/// batches and merged.
pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
+ /// When doing external sorting, the maximum number of spilled files to
+ /// read back at once. Those read files in the same merge step will be
sort-
+ /// preserving-merged and re-spilled, and the step will be repeated to
reduce
+ /// the number of spilled files in multiple passes, until a final
sorted run
+ /// can be produced.
+ pub sort_max_spill_merge_degree: usize, default = 16
Review Comment:
I have a concern about this that there can still be memory issue, if the
batch from each stream together is above the memory limit
I have an implementation for this that is completely memory safe and will
try to create a PR for that for inspiration
The way to decide on the degree is actually by storing for each spill file
the largest amount of memory a single record batch taken, and then when
deciding on the degree, you simply grow until you can no longer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]