isidentical commented on issue #3579: URL: https://github.com/apache/arrow-datafusion/issues/3579#issuecomment-1255596028
I've been looking into this and noticed that there are currently two places where this needs to be handled (and from what I've understood, they are actually separate optimizations). The first place is what you mentioned in this issue, which is during the spill process we sort everything we have and limit the actual output (which would mean that if we have 10 in-memory batches of limit 50, and if we can easily handle 100 records, we won't need to spill after sorting everything in place since the resulting batch would only have 50 items at most): https://github.com/apache/arrow-datafusion/blob/add10a67c8e16aca0a683957ddbea29a2a3a4156/datafusion/core/src/physical_plan/sorts/sort.rs#L281-L282 The second place is actually where we make the first allocation. We calculate the size of the given batch, and then we assume down there that the size of the batch we got from the sort is actually the same as the size of the input batch. https://github.com/apache/arrow-datafusion/blob/add10a67c8e16aca0a683957ddbea29a2a3a4156/datafusion/core/src/physical_plan/sorts/sort.rs#L119-L121 But thanks to the TopK PR, this is no longer the case. The sorting might actually reduce the size while we are still inserting it, so we could in theory free all the redundant memory first. This would actually help a lot, since the next `insert_batch` will have a lot of free-space to operate on. I've had a very rough benchmark (of a very small subset of the data from ClickHouse benchmark) and it seems like we are saving around ~320 spills (it goes from 320 spills to no spills) just by properly adjusting the used memory size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
