sfluor opened a new issue, #16517:
URL: https://github.com/apache/datafusion/issues/16517
### Describe the bug
When aggregating with `array_agg` using a limited memory pool we hit
ResourceExhausted errors in the `array_agg` accumulator.
After investigation this seems to come from over-accounting memory from
shared arrow buffers. We might keep a single value from a 8k rows arrow array
but will account the memory taken by the whole arrow array rather than just the
value we kept.
### To Reproduce
Run `datafusion-cli` with a memory limit `datafusion-cli -m 100m`
Create a table:
```sql
CREATE table logs AS
SELECT
('Chrome-' || (random() * 1000)::int) as user_agent,
(n.value % 40000 )::text as client_id,
(n.value % 21 ==0) as is_internal
FROM
generate_series(1, 300000) n;
```
Try running a simple `array_agg` query:
```
WITH
user_agents AS (
SELECT
DISTINCT user_agent, client_id
FROM
logs
WHERE
NOT is_internal
)
SELECT
client_id,
ARRAY_AGG(user_agent) AS user_agent
FROM
user_agents
GROUP BY
client_id
```
Fails with:
```
Resources exhausted: Additional allocation failed with top memory consumers
(across reservations) as:
GroupedHashAggregateStream[6] (array_agg(user_agents.user_agent))#44(can
spill: true) consumed 2.9 MB,
GroupedHashAggregateStream[2] (array_agg(user_agents.user_agent))#36(can
spill: true) consumed 2.2 MB,
GroupedHashAggregateStream[3] (array_agg(user_agents.user_agent))#38(can
spill: true) consumed 2.2 MB.
Error: Failed to allocate additional 609.0 MB for
GroupedHashAggregateStream[0] (array_agg(user_agents.user_agent)) with 0.0 B
already allocated for this reservation - 75.3 MB remain available
for the total pool
```
Further inspection (adding debug prints) shows that we over-report memory in
the `array_agg` `merge_batch`
[function](https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/array_agg.rs#L344):
```
Addr:0x418911c18b0 Data type:Utf8 Length: 1, Size of first value: 9, Mem
size: 162362
[X][Utf8] 1 Size before: 56 + mem_size:162362 -> 9
[X][Utf8] 1 Size after: 162482
```
### Expected behavior
Query should succeed within the memory limits
### Additional context
Similar to: https://github.com/apache/datafusion/issues/16055
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]