notfilippo commented on issue #16841:
URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3102449595

   According to the docs in 
[`MemoryPool`](https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-design):
   
   > DataFusion’s design ONLY limits operators that require “large” amounts of 
memory (proportional to number of input rows), such as GroupByHashExec. It does 
NOT track and limit memory used internally by other operators such as 
DataSourceExec or the RecordBatches that flow between operators. Furthermore, 
operators should not reserve memory for the batches they produce. Instead, if a 
parent operator needs to hold batches from its children in memory for an 
extended period, it is the parent operator’s responsibility to reserve the 
necessary memory for those batches.
   
   I would argue that when accounting anything that is not owned (like 
`ArrayRef`, `ScalarValue::List(...)`, ...), the memory reserved should be equal 
to the bytes occupied by the reference itself (`size_of::<Arc>()` or 
potentially even 0) and not the size of referenced object.
   
   The caveat I see with this approach is that in all the places where we 
actually copy data (via something like `ScalarValue::compact`) we need to be 
able to correctly account the size of the new copy in addition to the size of 
the reference.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to