[GitHub] [arrow-datafusion] isidentical commented on issue #3579: Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer

GitBox Thu, 22 Sep 2022 14:56:37 -0700


isidentical commented on issue #3579:
URL: 
https://github.com/apache/arrow-datafusion/issues/3579#issuecomment-1255596028


   I've been looking into this and noticed that there are currently two places 
where this needs to be handled (and from what I've understood, they are 
actually separate optimizations).
   
   The first place is what you mentioned in this issue, which is during the 
spill process we sort everything we have and limit the actual output (which 
would mean that if we have 10 in-memory batches of limit 50, and if we can 
easily handle 100 records, we won't need to spill after sorting everything in 
place since the resulting batch would only have 50 items at most):
   
https://github.com/apache/arrow-datafusion/blob/add10a67c8e16aca0a683957ddbea29a2a3a4156/datafusion/core/src/physical_plan/sorts/sort.rs#L281-L282
   
   The second place is actually where we make the first allocation. We 
calculate the size of the given batch, and then we assume down there that the 
size of the batch we got from the sort is actually the same as the size of the 
input batch.
   
https://github.com/apache/arrow-datafusion/blob/add10a67c8e16aca0a683957ddbea29a2a3a4156/datafusion/core/src/physical_plan/sorts/sort.rs#L119-L121
   
   But thanks to the TopK PR, this is no longer the case. The sorting might 
actually reduce the size while we are still inserting it, so we could in theory 
free all the redundant memory first. This would actually help a lot, since the 
next `insert_batch` will have a lot of free-space to operate on. I've had a 
very rough benchmark (of a very small subset of the data from ClickHouse 
benchmark) and it seems like we are saving around ~320 spills (it goes from 320 
spills to no spills) just by properly adjusting the used memory size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] isidentical commented on issue #3579: Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer

Reply via email to