alamb commented on issue #16841:
URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3113393092

   > Some imply copying/compacting just the necessary slice of data from the 
underlying buffer (ScalarValue::compact) so that it's actually owned by the 
consumer, but in certain cases that could take a hit to performance.
   
   > The solution of copying can be less efficient in terms of CPU but is the 
best in terms of retained memory (it keeps the goodies of the streaming model 
even in nasty corner cases). 
   
   I think you have identified the key tradeoff:  It is often faster CPU wise 
to reuse the existing allocation but that risks holding on to more memory than 
necessary. Copying only the data needed results in less memory use but consumes 
CPU time. 
   
   I think there are several places in DataFusion that deal with a similar 
issue, that perhaps we can take inspiration from. I think at a high level they 
all basically employ a heuristic to decide when copying is better than reusing 
the allocation
   
   1. TopK: 
[`maybe_compact`](https://github.com/apache/datafusion/blob/9bb309c88ae94d495b487a97cb635c52225689e8/datafusion/physical-plan/src/topk/mod.rs#L695-L694)
   2. `StringView` (and calling "gc" on it), for example in 
[`BatchCoalescer`](https://github.com/apache/datafusion/blob/9d2f04996604e709ee440b65f41e7b882f50b788/datafusion/physical-plan/src/coalesce/mod.rs#L202-L201)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to