Dandandan commented on code in PR #20500:
URL: https://github.com/apache/datafusion/pull/20500#discussion_r2994638863
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -617,8 +618,14 @@ impl BatchPartitioner {
batch.schema(),
columns,
&options,
- )
- .unwrap();
+ )?;
+
+ // When `StringViewArray`s are present, the
`take_arrays` call above
+ // re-uses data buffers from the original array.
This causes the memory
+ // pool to count the same data buffers multiple
times, once for each
+ // consumer of the repartition.
+ // So we gc the output arrays, which creates new
data buffers.
+ let batch = gc_stringview_arrays(batch)?;
Review Comment:
I think it would be best to use the coalesce kernels (which do GC-ing
already I think) before sending them to the upstream partitions.
I got some mixed performance results from that before, but I think the
upcoming morsel / workstealing changes might be able to improve this (as it
won't benefit from pushing the copying work over to a new (possibly idling)
thread)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]