Nachiket-Roy opened a new pull request, #19751:
URL: https://github.com/apache/datafusion/pull/19751

   ## Which issue does this PR close?
   
   Related to #19679
   Related to #19695
   
   ## Rationale for this change
   
   It was observed that downstream operators (such as `SortExec`) can receive 
extremely large `RecordBatch` outputs from `AggregateExec`, violating the 
implicit `batch_size` convention used throughout the physical plan.
   
   An initial hypothesis was that normalizing batch sizes at the 
`AggregateExec` -> `SortExec` boundary would be sufficient. However, while 
implementing and testing this, it became clear that:
   - `GroupedHashAggregateStream` allocates group keys and accumulator state 
before memory reservation checks are enforced
   - By the time emission, spilling, or streaming merge logic runs, memory 
pressure may already be unavoidable
   - Even when output batches are sliced to` batch_size`, internal state can 
exceed memory limits
   - As a result, downstream operators still observe oversized memory usage
   
   This indicates that the issue is not a simple emission bug, but a design 
boundary between:
   - aggregation state growth
   - memory reservation timing
   - `batch_size` as an output convention vs. a hard invariant
   
   This PR aims to:
   - reduce memory pressure in specific transitions 
   - make failure modes more predictable
   - provide a basis for further discussion
   
   - ## What changes are included in this PR?
   
   - Releases hash-aggregation memory before switching to streaming merge
   - Ensures memory reservations are explicitly reset when transitioning 
execution phases
   - Avoids retaining aggregation state after it is no longer needed
   - Tightens execution-state transitions to prevent double emission and late 
memory checks
   - Improves internal invariants around spill and merge behavior
   
   Importantly, this PR does not claim to fully enforce `batch_size` at all 
stages, nor does it prevent all oversized intermediate states. Instead, it 
reduces memory pressure and clarifies current limitations.
   
   ## Are these changes tested?
   
   Yes, but edge cases revealed design limitations rather than simple bugs
   The PR is submitted primarily for review and discussion rather than as a 
final fix.
   
   ## Are there any user-facing changes?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to