2010YOUY01 commented on code in PR #17429:
URL: https://github.com/apache/datafusion/pull/17429#discussion_r2324056495


##########
datafusion/physical-plan/src/joins/sort_merge_join/stream.rs:
##########
@@ -185,11 +185,12 @@ impl StreamedBatch {
 }
 
 /// A buffered batch that contains contiguous rows with same join key
+///
+/// `BufferedBatch` can exist as either an in-memory `RecordBatch` or a 
`RefCountedTempFile` on disk.
 #[derive(Debug)]
 pub(super) struct BufferedBatch {
-    /// The buffered record batch
-    /// None if the batch spilled to disk th
-    pub batch: Option<RecordBatch>,
+    /// Represents in memory or spilled record batch
+    pub batch: BufferedBatchState,

Review Comment:
   I think this `BufferedBatchState` enum should include all intermediate data 
into the states, for example `join_arrays`. When the main batch is spilled, the 
join key array should either be also spilled or throw away, otherwise it will 
be leaking memory.
   
   Perhaps we can leave a TODO in the comment now? This will be a big change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to