kosiew commented on code in PR #20500:
URL: https://github.com/apache/datafusion/pull/20500#discussion_r2944875606
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -2602,6 +2642,31 @@ mod tests {
Ok(())
}
+
+ #[tokio::test]
+ async fn hash_repartition_string_view_compaction() -> Result<()> {
Review Comment:
This test does not actually exercise the regression it is meant to cover.
It only checks that `repartition` returns all rows. That would also pass
before the `gc()` change.
As a result, we still do not have a test that would catch the over-counting
bug if this logic regresses.
Please add an assertion that observes the compaction or accounting behavior
directly. For example:
* Check that the total `get_array_memory_size()` across the repartitioned
outputs stays close to the original batch, instead of scaling with the number
of output partitions.
* Test spill behavior under a tight memory limit (e.g., spilled bytes).
* Verify `StringViewArray` buffer ownership after repartition, so outputs no
longer all retain the original shared payload buffer.
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -611,11 +611,24 @@ impl BatchPartitioner {
let columns =
take_arrays(batch.columns(), &indices_array,
None)?;
+ let new_columns = columns
+ .into_iter()
+ .map(|col| {
+ if let Some(sv) =
+
col.as_any().downcast_ref::<StringViewArray>()
+ {
+ Arc::new(sv.gc())
Review Comment:
The new `StringViewArray::gc()` pass in repartition is very similar to the
existing `organize_stringview_arrays` logic in
`datafusion/physical-plan/src/sorts/sort.rs`.
A small shared helper would keep the workaround in one place and reduce the
chance that sort/repartition diverge when Arrow view handling changes again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]