EmilyMatt opened a new issue, #19481:
URL: https://github.com/apache/datafusion/issues/19481

   ### Is your feature request related to a problem or challenge?
   
   In many operators, when the input is done, the output is emitted all at 
once, this leads to huge batches, which can be inefficient in some cases, and 
OOM in other cases.
   
   For example:
   Using a fair pool implementation where each operator gets 1GB of memory, and 
an aggregate's total size is 0.99GB.
   The aggregate will Emit::All, and output the 0.99GB batch, and a following 
sort will immediately try and allocate memory for the sorting(let's say 2x the 
batch size).
   The memory pool has no idea that the agg is done at this point, as the 
reservation still exists, so it will try and maintain 1GB per operator.
   
   The sort now attempts to get about 1.98GB, which is almost twice what it is 
allocated.
   
   ### Describe the solution you'd like
   
   If the aggregate(and any other operator, for that matter) respected 
batch_size, it would output smaller batches, which are easily handled by 
ExternalSorter, which will spill when it needs to, and at some point when poll 
returns Read(None) we can drop the input stream, and the reservation will drop 
and the sort will get even more memory.
   That will make every application much more resilient under every memory 
constraints.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to