notfilippo commented on issue #16841: URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3102449595
According to the docs in [`MemoryPool`](https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-design): > DataFusion’s design ONLY limits operators that require “large” amounts of memory (proportional to number of input rows), such as GroupByHashExec. It does NOT track and limit memory used internally by other operators such as DataSourceExec or the RecordBatches that flow between operators. Furthermore, operators should not reserve memory for the batches they produce. Instead, if a parent operator needs to hold batches from its children in memory for an extended period, it is the parent operator’s responsibility to reserve the necessary memory for those batches. I would argue that when accounting anything that is not owned (like `ArrayRef`, `ScalarValue::List(...)`, ...), the memory reserved should be equal to the bytes occupied by the reference itself (`size_of::<Arc>()` or potentially even 0) and not the size of referenced object. The caveat I see with this approach is that in all the places where we actually copy data (via something like `ScalarValue::compact`) we need to be able to correctly account the size of the new copy in addition to the size of the reference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org