rluvaton commented on code in PR #15700:
URL: https://github.com/apache/datafusion/pull/15700#discussion_r2213289055
##########
datafusion/physical-plan/src/aggregates/row_hash.rs:
##########
@@ -1067,14 +1074,13 @@ impl GroupedHashAggregateStream {
sort_batch(&batch, &expr, None)
})),
)));
- for spill in self.spill_state.spills.drain(..) {
- let stream =
self.spill_state.spill_manager.read_spill_as_stream(spill)?;
- streams.push(stream);
- }
+
self.spill_state.is_stream_merging = true;
self.input = StreamingMergeBuilder::new()
.with_streams(streams)
.with_schema(schema)
+ .with_spill_manager(self.spill_state.spill_manager.clone())
+ .with_sorted_spill_files(std::mem::take(&mut
self.spill_state.spills))
Review Comment:
> The issue is, without special handling, it's possible that in-mem batches
will take most of the available memory budget, and leave only a very small
memory part for multi-pass spilling to continue. This can cause slow downs or
even prevent some cases to finish.
Actually this case is already handled, so if I have 4 in memory streams and
5 spill files.
and I don't have enough memory to merge with any spill file we will merge
the in memory streams and spill to disk and next iteration will be the same as
if we got 6 spill files.
this is the same behavior as spilling before but optimized
See:
https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L178-L203
we first `take` the `in_memory_streams` (and don't add them back anymore)
https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L180
and we try to get spill files to merge but the required number of spill
files needed is 2 - `in_memory_stream.len()`
https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L185-L186
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]