Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-09-08 Thread via GitHub


github-actions[bot] closed pull request #16519: fix: Incorrect memory 
accounting in `array_agg` function
URL: https://github.com/apache/datafusion/pull/16519


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-08-29 Thread via GitHub


github-actions[bot] commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3238855651

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-30 Thread via GitHub


sfluor commented on code in PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#discussion_r2174894499


##
datafusion/functions-aggregate/src/array_agg.rs:
##
@@ -341,12 +341,20 @@ impl Accumulator for ArrayAggAccumulator {
 Some(values) => {
 // Make sure we don't insert empty lists
 if !values.is_empty() {
-self.values.push(values);
+// The ArrayRef might be holding a reference to its 
original input buffer, so
+// storing it here directly copied/compacted avoids over 
accounting memory
+// not used here.
+self.values
+.push(make_array(copy_array_data(&values.to_data(;
 }

Review Comment:
   Maybe to clarify, I did try this approach and the benchmarks shows on-par 
performance but I'm wondering if maintenability wise it will be alright given 
the multitude of arrow types we would need to support (I only did it for 
`int64array` since this is what the benchmarks are using):
   
   ```
   Gnuplot not found, using plotters backend
   array_agg i64 merge_batch no nulls
   time:   [59.457 ns 60.325 ns 61.260 ns]
   change: [+7.3648% +8.6679% +10.200%] (p = 0.00 < 
0.05)
   Performance has regressed.
   
   Benchmarking array_agg i64 merge_batch all nulls, 100% of nulls point to a 
zero length array: Collecting 100 samples in estimated 5. s (135
   array_agg i64 merge_batch all nulls, 100% of nulls point to a zero length 
array
   time:   [36.705 ns 37.023 ns 37.408 ns]
   change: [+2.9532% +3.9617% +5.0337%] (p = 0.00 < 
0.05)
   Performance has regressed.
   Found 2 outliers among 100 measurements (2.00%)
 2 (2.00%) high severe
   
   Benchmarking array_agg i64 merge_batch all nulls, 90% of nulls point to a 
zero length array: Collecting 100 samples in estimated 5. s (135M
   array_agg i64 merge_batch all nulls, 90% of nulls point to a zero length 
array
   time:   [36.260 ns 36.470 ns 36.725 ns]
   change: [+1.7526% +2.6531% +3.4398%] (p = 0.00 < 
0.05)
   Performance has regressed.
   Found 1 outliers among 100 measurements (1.00%)
 1 (1.00%) high mild
   
   Benchmarking array_agg i64 merge_batch 30% nulls, 100% of nulls point to a 
zero length array: Collecting 100 samples in estimated 5.0134 s (1.4
   array_agg i64 merge_batch 30% nulls, 100% of nulls point to a zero length 
array
   time:   [3.6204 µs 3.6433 µs 3.6793 µs]
   change: [+0.4207% +0.9048% +1.4450%] (p = 0.00 < 
0.05)
   Change within noise threshold.
   Found 23 outliers among 100 measurements (23.00%)
 2 (2.00%) low severe
 4 (4.00%) low mild
 8 (8.00%) high mild
 9 (9.00%) high severe
   
   Benchmarking array_agg i64 merge_batch 70% nulls, 100% of nulls point to a 
zero length array: Collecting 100 samples in estimated 5.0176 s (1.4
   array_agg i64 merge_batch 70% nulls, 100% of nulls point to a zero length 
array
   time:   [3.5311 µs 3.5430 µs 3.5560 µs]
   change: [-3.9977% -2.2613% -0.8169%] (p = 0.00 < 
0.05)
   Change within noise threshold.
   Found 6 outliers among 100 measurements (6.00%)
 6 (6.00%) high mild
   
   Benchmarking array_agg i64 merge_batch 30% nulls, 99% of nulls point to a 
zero length array: Collecting 100 samples in estimated 5.0411 s (1800
   array_agg i64 merge_batch 30% nulls, 99% of nulls point to a zero length 
array
   time:   [2.8977 ms 2.9160 ms 2.9351 ms]
   change: [-4.2235% -1.5538% +0.6217%] (p = 0.25 > 
0.05)
   No change in performance detected.
   
   Benchmarking array_agg i64 merge_batch 70% nulls, 99% of nulls point to a 
zero length array: Warming up for 3. s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 6.0s, enable flat sampling, or reduce sample count to 60.
   Benchmarking array_agg i64 merge_batch 70% nulls, 99% of nulls point to a 
zero length array: Collecting 100 samples in estimated 6.0400 s (5050
   array_agg i64 merge_batch 70% nulls, 99% of nulls point to a zero length 
array
   time:   [1.2365 ms 1.2409 ms 1.2454 ms]
   change: [-0.2990% +0.4915% +1.1750%] (p = 0.20 > 
0.05)
   No change in performance detected.
   Found 2 outliers among 100 measurements (2.00%)
 1 (1.00%) high mild
 1 (1.00%) high severe
   
   Benchmarking array_agg i64 merge_batch 30% nulls, 90% of nulls point to a 
zero length array: Collecting 100 samples in estimated 5.0041 s (1800
   array_agg i64 merg

Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


alamb commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3005370005

   There appears to be an array_agg benchmark -- I will run that on this PR to 
see what it shows


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


alamb commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3005374575

   🤖 `./gh_compare_branch.sh` [Benchmark 
Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing sami/fix-overaccounting-of-memory-in-array-agg 
(1da3e04be30c13ffa85e56f5241e211e7f7d13f2) to 
b6c8cc57760686fffe4878e69c1be27e4d9f5e68 
[diff](https://github.com/apache/datafusion/compare/b6c8cc57760686fffe4878e69c1be27e4d9f5e68..1da3e04be30c13ffa85e56f5241e211e7f7d13f2)
   Benchmarks: array_agg
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


alamb commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3005932928

   🤖: Benchmark completed
   
   Details
   
   
   
   ```
   group
  main   
sami_fix-overaccounting-of-memory-in-array-agg
   -
     
--
   array_agg i64 merge_batch 30% nulls, 0% of nulls point to a zero length 
array  1.00547.4±7.12µs? ?/sec9.08  5.0±0.03ms? 
?/sec
   array_agg i64 merge_batch 30% nulls, 100% of nulls point to a zero length 
array1.00  6.1±0.03µs? ?/sec7.26 44.5±0.85µs? 
?/sec
   array_agg i64 merge_batch 30% nulls, 50% of nulls point to a zero length 
array 1.00547.7±1.36µs? ?/sec9.28  5.1±0.02ms? 
?/sec
   array_agg i64 merge_batch 30% nulls, 90% of nulls point to a zero length 
array 1.00548.2±0.87µs? ?/sec9.38  5.1±0.02ms? 
?/sec
   array_agg i64 merge_batch 30% nulls, 99% of nulls point to a zero length 
array 1.00561.4±4.24µs? ?/sec9.12  5.1±0.06ms? 
?/sec
   array_agg i64 merge_batch 70% nulls, 0% of nulls point to a zero length 
array  1.00243.2±1.64µs? ?/sec8.13  1977.2±10.35µs? 
?/sec
   array_agg i64 merge_batch 70% nulls, 100% of nulls point to a zero length 
array1.00  5.9±0.02µs? ?/sec3.96 23.2±0.43µs? 
?/sec
   array_agg i64 merge_batch 70% nulls, 50% of nulls point to a zero length 
array 1.00242.1±0.26µs? ?/sec8.39  2.0±0.02ms? 
?/sec
   array_agg i64 merge_batch 70% nulls, 90% of nulls point to a zero length 
array 1.00243.5±0.44µs? ?/sec8.08  1968.8±15.78µs? 
?/sec
   array_agg i64 merge_batch 70% nulls, 99% of nulls point to a zero length 
array 1.00243.0±0.62µs? ?/sec8.26  2.0±0.01ms? 
?/sec
   array_agg i64 merge_batch all nulls, 100% of nulls point to a zero length 
array1.00 86.9±0.11ns? ?/sec1.01 87.6±0.14ns? 
?/sec
   array_agg i64 merge_batch all nulls, 90% of nulls point to a zero length 
array 1.00 86.9±0.09ns? ?/sec1.01 87.6±0.15ns? 
?/sec
   array_agg i64 merge_batch no nulls   
  1.00100.8±0.12ns? ?/sec544.1454.8±3.04µs? 
?/sec
   ```
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


alamb commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3005913242

   🤖 `./gh_compare_branch_bench.sh` [Benchmark 
Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh)
 Running
   Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing sami/fix-overaccounting-of-memory-in-array-agg 
(1da3e04be30c13ffa85e56f5241e211e7f7d13f2) to 
b6c8cc57760686fffe4878e69c1be27e4d9f5e68 
[diff](https://github.com/apache/datafusion/compare/b6c8cc57760686fffe4878e69c1be27e4d9f5e68..1da3e04be30c13ffa85e56f5241e211e7f7d13f2)
   BENCH_NAME=array_agg
   BENCH_COMMAND=cargo bench --bench array_agg
   BENCH_FILTER=
   BENCH_BRANCH_NAME=sami_fix-overaccounting-of-memory-in-array-agg
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


alamb commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3005911800

   🤖 `./gh_compare_branch_bench.sh` [Benchmark 
Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh)
 Running
   Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing sami/fix-overaccounting-of-memory-in-array-agg 
(1da3e04be30c13ffa85e56f5241e211e7f7d13f2) to 
b6c8cc57760686fffe4878e69c1be27e4d9f5e68 
[diff](https://github.com/apache/datafusion/compare/b6c8cc57760686fffe4878e69c1be27e4d9f5e68..1da3e04be30c13ffa85e56f5241e211e7f7d13f2)
   BENCH_NAME=sql_planner
   BENCH_COMMAND=cargo bench --bench sql_planner
   BENCH_FILTER=
   BENCH_BRANCH_NAME=sami_fix-overaccounting-of-memory-in-array-agg
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


alamb commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3005910021

   🤖 `./gh_compare_branch.sh` [Benchmark 
Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing sami/fix-overaccounting-of-memory-in-array-agg 
(1da3e04be30c13ffa85e56f5241e211e7f7d13f2) to 
b6c8cc57760686fffe4878e69c1be27e4d9f5e68 
[diff](https://github.com/apache/datafusion/compare/b6c8cc57760686fffe4878e69c1be27e4d9f5e68..1da3e04be30c13ffa85e56f5241e211e7f7d13f2)
   Benchmarks: array_agg
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


gabotechs commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3004986202

   EDIT: I'm seeing that there's cases where the `merge_batch` method is used 
for something other than merging states:
   
   
https://github.com/apache/datafusion/blob/9278233e9fe34f7712370f2fa583ba94663a0678/datafusion/physical-plan/src/aggregates/no_grouping.rs#L232-L239
   
   Which means that probably we do want compaction to happen also in the 
`merge_batch` function. I think this is good to go then 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


alamb commented on PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#issuecomment-3005367832

   Thanks for the code and review @gabotechs and @sfluor 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


alamb commented on code in PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#discussion_r2167112157


##
datafusion/functions-aggregate/src/array_agg.rs:
##
@@ -341,12 +341,20 @@ impl Accumulator for ArrayAggAccumulator {
 Some(values) => {
 // Make sure we don't insert empty lists
 if !values.is_empty() {
-self.values.push(values);
+// The ArrayRef might be holding a reference to its 
original input buffer, so
+// storing it here directly copied/compacted avoids over 
accounting memory
+// not used here.
+self.values
+.push(make_array(copy_array_data(&values.to_data(;
 }

Review Comment:
   I wonder if we should add a special case to copy_array_data to avoid copying 
the data when it already is only a single row / has no offset 🤔 
   
   Right now it seems to copy the data unconditionally which is a non trivial 
overhead on each row 🤔 
   
   
https://github.com/apache/datafusion/blob/a87d6f21c96ea592c6e9f2a3f566d0c9862031e5/datafusion/common/src/scalar/mod.rs#L3564-L3567
   
   Perhaps we can do that as a follow of PR
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


gabotechs commented on code in PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#discussion_r2166863158


##
datafusion/functions-aggregate/src/array_agg.rs:
##
@@ -341,12 +341,20 @@ impl Accumulator for ArrayAggAccumulator {
 Some(values) => {
 // Make sure we don't insert empty lists
 if !values.is_empty() {
-self.values.push(values);
+// The ArrayRef might be holding a reference to its 
original input buffer, so
+// storing it here directly copied/compacted avoids over 
accounting memory
+// not used here.
+self.values
+.push(make_array(copy_array_data(&values.to_data(;
 }

Review Comment:
   https://github.com/apache/datafusion/pull/16519#issuecomment-3004986202



##
datafusion/functions-aggregate/src/array_agg.rs:
##
@@ -994,6 +1002,34 @@ mod tests {
 Ok(())
 }
 
+#[test]
+fn does_not_over_account_memory_for_merge() -> Result<()> {
+let (mut acc1, mut acc2) = 
ArrayAggAccumulatorBuilder::string().build_two()?;
+
+let a1 = ListArray::from_iter_primitive::(vec![
+Some(vec![Some(0), Some(1), Some(2)]),
+Some(vec![Some(3)]),
+None,
+Some(vec![Some(4)]),
+]);
+let a2 = ListArray::from_iter_primitive::(vec![
+Some(vec![Some(0), Some(1), Some(2)]),
+Some(vec![Some(3)]),
+None,
+Some(vec![Some(4)]),
+]);
+
+acc1.merge_batch(&[Arc::new(a1.slice(0, 1))])?;
+acc2.merge_batch(&[Arc::new(a2.slice(0, 1))])?;
+
+acc1 = merge(acc1, acc2)?;

Review Comment:
   https://github.com/apache/datafusion/pull/16519#issuecomment-3004986202



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


gabotechs commented on code in PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#discussion_r2165952424


##
datafusion/functions-aggregate/src/array_agg.rs:
##
@@ -994,6 +1002,34 @@ mod tests {
 Ok(())
 }
 
+#[test]
+fn does_not_over_account_memory_for_merge() -> Result<()> {
+let (mut acc1, mut acc2) = 
ArrayAggAccumulatorBuilder::string().build_two()?;
+
+let a1 = ListArray::from_iter_primitive::(vec![
+Some(vec![Some(0), Some(1), Some(2)]),
+Some(vec![Some(3)]),
+None,
+Some(vec![Some(4)]),
+]);
+let a2 = ListArray::from_iter_primitive::(vec![
+Some(vec![Some(0), Some(1), Some(2)]),
+Some(vec![Some(3)]),
+None,
+Some(vec![Some(4)]),
+]);
+
+acc1.merge_batch(&[Arc::new(a1.slice(0, 1))])?;
+acc2.merge_batch(&[Arc::new(a2.slice(0, 1))])?;
+
+acc1 = merge(acc1, acc2)?;

Review Comment:
   The `merge_batch` functions do not receive arbitrary data, it receives the 
results of calling `state()` in other accumulators. A fairer test would be to 
do something like:
   
   ```suggestion
   acc1.update_batch(&[Arc::clone(a1.values())])?;
   acc2.update_batch(&[Arc::clone(a2.values())])?;
   
   acc1 = merge(acc1, acc2)?;
   ```
   
   If you run this, you would notice that the test actually passes without your 
changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


gabotechs commented on code in PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#discussion_r2165952424


##
datafusion/functions-aggregate/src/array_agg.rs:
##
@@ -994,6 +1002,34 @@ mod tests {
 Ok(())
 }
 
+#[test]
+fn does_not_over_account_memory_for_merge() -> Result<()> {
+let (mut acc1, mut acc2) = 
ArrayAggAccumulatorBuilder::string().build_two()?;
+
+let a1 = ListArray::from_iter_primitive::(vec![
+Some(vec![Some(0), Some(1), Some(2)]),
+Some(vec![Some(3)]),
+None,
+Some(vec![Some(4)]),
+]);
+let a2 = ListArray::from_iter_primitive::(vec![
+Some(vec![Some(0), Some(1), Some(2)]),
+Some(vec![Some(3)]),
+None,
+Some(vec![Some(4)]),
+]);
+
+acc1.merge_batch(&[Arc::new(a1.slice(0, 1))])?;
+acc2.merge_batch(&[Arc::new(a2.slice(0, 1))])?;
+
+acc1 = merge(acc1, acc2)?;

Review Comment:
   The `merge_batch` functions do not receive arbitrary data, it receives the 
results of calling `state()` in other accumulators. A fairer test would be to 
do something like:
   
   ```suggestion
   acc1.update_batch(&[Arc::new(a1.slice(0, 1))])?;
   acc2.update_batch(&[Arc::new(a2.slice(0, 1))])?;
   
   acc1 = merge(acc1, acc2)?;
   ```
   
   With this, you would notice that the test result is the same regardless of 
the changes in this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Incorrect memory accounting in `array_agg` function [datafusion]

2025-06-25 Thread via GitHub


gabotechs commented on code in PR #16519:
URL: https://github.com/apache/datafusion/pull/16519#discussion_r2165946809


##
datafusion/functions-aggregate/src/array_agg.rs:
##
@@ -341,12 +341,20 @@ impl Accumulator for ArrayAggAccumulator {
 Some(values) => {
 // Make sure we don't insert empty lists
 if !values.is_empty() {
-self.values.push(values);
+// The ArrayRef might be holding a reference to its 
original input buffer, so
+// storing it here directly copied/compacted avoids over 
accounting memory
+// not used here.
+self.values
+.push(make_array(copy_array_data(&values.to_data(;
 }

Review Comment:
   🤔 I'm not sure if this will solve the issue. Keep in mind that the 
`merge_batch` method argument receives the states of other accumulators, which 
already hold "compacted" data, so I'd expect this compaction here to be 
unnecessary.



##
datafusion/functions-aggregate/src/array_agg.rs:
##
@@ -994,6 +1002,34 @@ mod tests {
 Ok(())
 }
 
+#[test]
+fn does_not_over_account_memory_for_merge() -> Result<()> {
+let (mut acc1, mut acc2) = 
ArrayAggAccumulatorBuilder::string().build_two()?;
+
+let a1 = ListArray::from_iter_primitive::(vec![
+Some(vec![Some(0), Some(1), Some(2)]),
+Some(vec![Some(3)]),
+None,
+Some(vec![Some(4)]),
+]);
+let a2 = ListArray::from_iter_primitive::(vec![
+Some(vec![Some(0), Some(1), Some(2)]),
+Some(vec![Some(3)]),
+None,
+Some(vec![Some(4)]),
+]);
+
+acc1.merge_batch(&[Arc::new(a1.slice(0, 1))])?;
+acc2.merge_batch(&[Arc::new(a2.slice(0, 1))])?;
+
+acc1 = merge(acc1, acc2)?;

Review Comment:
   The `merge_batch` functions do not receive arbitrary data, they receive the 
results of calling `state()` in other accumulators. A fairer test would be to 
do something like:
   
   ```suggestion
   acc1.update_batch(&[Arc::clone(a1.values())])?;
   acc2.update_batch(&[Arc::clone(a2.values())])?;
   
   acc1 = merge(acc1, acc2)?;
   ```
   
   If you run this, you would notice that the test actually passes without your 
changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]