alamb commented on issue #16841: URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3113393092
> Some imply copying/compacting just the necessary slice of data from the underlying buffer (ScalarValue::compact) so that it's actually owned by the consumer, but in certain cases that could take a hit to performance. > The solution of copying can be less efficient in terms of CPU but is the best in terms of retained memory (it keeps the goodies of the streaming model even in nasty corner cases). I think you have identified the key tradeoff: It is often faster CPU wise to reuse the existing allocation but that risks holding on to more memory than necessary. Copying only the data needed results in less memory use but consumes CPU time. I think there are several places in DataFusion that deal with a similar issue, that perhaps we can take inspiration from. I think at a high level they all basically employ a heuristic to decide when copying is better than reusing the allocation 1. TopK: [`maybe_compact`](https://github.com/apache/datafusion/blob/9bb309c88ae94d495b487a97cb635c52225689e8/datafusion/physical-plan/src/topk/mod.rs#L695-L694) 2. `StringView` (and calling "gc" on it), for example in [`BatchCoalescer`](https://github.com/apache/datafusion/blob/9d2f04996604e709ee440b65f41e7b882f50b788/datafusion/physical-plan/src/coalesce/mod.rs#L202-L201) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org