[PR] Respect memory pool size when in GroupedHashAggregateStream when spilling is not possible [datafusion]

via GitHub Thu, 11 Dec 2025 07:44:26 -0800


pepijnve opened a new pull request, #19287:
URL: https://github.com/apache/datafusion/pull/19287


   ## Which issue does this PR close?
   
   - Closes #19286.
   
   ## Rationale for this change
   
   GroupedHashAggregateStream currently always reports that it can spill to the 
memory tracking subsystem even though this is dependent on the aggregation mode 
and the grouping order.
   The optimistic logic in `group_aggregate_batch` also does not correctly take 
these conditions into account
   
   ## What changes are included in this PR?
   
   - Correctly set `MemoryConsumer::can_spill` to reflect actual spilling 
behaviour
   - Align behaviour of `group_aggregate_batch` and 
`spill_previous_if_necessary`
   
   ## Are these changes tested?
   
   Added additional test case to demonstrate problem. This may not actually be 
necessary since other tests started failing as well. Still working on 
correcting those.
   
   ## Are there any user-facing changes?
   
   Yes, memory exhaustion may be reported much earlier in the query pipeline 
than is currently the case. In my local tests with a per consumer memory limit 
of 32MiB, grouped aggregation would consume 480MiB in practice. This was then 
reported by ExternalSortExec which choked on trying to reserve that much memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Respect memory pool size when in GroupedHashAggregateStream when spilling is not possible [datafusion]

Reply via email to