ding-young commented on code in PR #17315:
URL: https://github.com/apache/datafusion/pull/17315#discussion_r2312291785
##########
datafusion/physical-plan/src/spill/spill_manager.rs:
##########
@@ -194,7 +195,84 @@ impl GetSlicedSize for RecordBatch {
for array in self.columns() {
let data = array.to_data();
total += data.get_slice_memory_size()?;
+
+ // While StringViewArray holds large data buffer for non inlined
string, the Arrow layout (BufferSpec)
+ // does not include any data buffers. Currently,
ArrayData::get_slice_memory_size()
+ // under-counts memory size by accounting only views buffer
although data buffer is cloned during slice()
+ //
+ // Therefore, we manually add the sum of the lengths used by all
non inlined views
+ // on top of the sliced size for views buffer. This matches the
intended semantics of
+ // "bytes needed if we materialized exactly this slice into fresh
buffers".
+ // Note: if multiple arrays share the same data buffers, we may
double count each StringViewArray.
Review Comment:
I think it would not be an issue for most cases, so I only updated the
comment.
> IIRC, now we always `gc()` on StringView columns before spilling
However it seems like we don't `gc()` on StringView if it spills on multi
level merge. Maybe we should find out whether we need to `gc` for specific
cases (i.e. when in-memory streams are involved) or update this calculation
logic similar to `get_record_batch_memory_size`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]