gabotechs opened a new issue, #16841:
URL: https://github.com/apache/datafusion/issues/16841

   The current model used in DataFusion for measuring memory consumption 
assumes that the different entities that can consume memory (accumulators, 
joins, etc...) are the ones owning the data. This does not exactly match with 
how memory is managed in arrow-rs, where underlying data buffers might be 
referenced more than 1 time in different parts of DataFusion.
   
   For example: `ArrayAggAccumulator` accumulates data by storing `ArrayRef`s, 
so even if it's accumulating and retaining data, the actual memory was 
allocated before `ArrayAggAccumulator` came into play, and the accumulator is 
only adding a reference to it:
   
   
https://github.com/apache/datafusion/blob/ac407a19e030bfee092a0992093b886bde86d97e/datafusion/functions-aggregate/src/array_agg.rs#L214
   
   For that case, one could argue `ArrayAggAccumulator` is not really consuming 
any memory, as it's not performing new allocations, and the underlying data was 
there potentially even before the `ArrayAggAccumulator` was instantiated.
   
   There has been several attempts in the past towards addressing memory 
accounting issues in DataFusion code
   - https://github.com/apache/datafusion/pull/15924
   - https://github.com/apache/datafusion/pull/16346
   - https://github.com/apache/datafusion/pull/16519
   - https://github.com/apache/datafusion/pull/16816
   
   Some imply copying/compacting just the necessary slice of data from the 
underlying buffer (`ScalarValue::compact`) so that it's actually owned by the 
consumer, but in certain cases that could take a hit to performance.
   
   ---
   
   The main point about this issue is to start a conversation around what could 
be the ideal approach for memory counting:
   - Is copying/compacting accumulated data and calling 
`get_array_memory_size()` or storing array references and calling 
`get_slice_memory_size()` acceptable for measuring memory consumption?
   - Should the memory counting model in DataFusion be expanded so that it does 
not take into account just memory consumed, but also memory retained because of 
references to shared buffers


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to