alamb commented on code in PR #15469:
URL: https://github.com/apache/datafusion/pull/15469#discussion_r2020125627
##########
datafusion/physical-plan/src/sorts/sort.rs:
##########
@@ -397,16 +393,13 @@ impl ExternalSorter {
self.metrics.spill_metrics.spill_file_count.value()
}
- /// When calling, all `in_mem_batches` must be sorted (*), and then all of
them will
- /// be appended to the in-progress spill file.
- ///
- /// (*) 'Sorted' here means globally sorted for all buffered batches when
the
- /// memory limit is reached, instead of partially sorted within the batch.
- async fn spill_append(&mut self) -> Result<()> {
- assert!(self.in_mem_batches_sorted);
-
- // we could always get a chance to free some memory as long as we are
holding some
- if self.in_mem_batches.is_empty() {
+ /// Appending globally sorted batches to the in-progress spill file, and
clears
+ /// the `globally_sorted_batches` (also its memory reservation) afterwards.
+ async fn consume_and_spill_append(
+ &mut self,
+ globally_sorted_batches: &mut Vec<RecordBatch>,
Review Comment:
thank you for the name here -- much better
##########
datafusion/physical-plan/src/sorts/sort.rs:
##########
@@ -216,10 +216,8 @@ struct ExternalSorter {
// STATE BUFFERS:
// Fields that hold intermediate data during sorting
// ========================================================================
- /// Potentially unsorted in memory buffer
+ /// Unsorted input batches stored in the memory buffer
Review Comment:
Removing this field is an improvement 👍 thanks @2010YOUY01 and @comphead
Given there are still three fields whose relationship is tied, I wonder if
it would improve the code if we encoded that relationship in an actual rust
`enum` -- for example
```rust
enum SortState {
/// All data is in memory
Memory {
batches: Vec<RecordBatch>,
},
/// intermediate data is spilling to disk
Spilling {
batches: Vec<RecordBatch>,
in_progress_spill_file: InProgressSpillFile,
}
...
}
```
🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]