gabotechs opened a new issue, #16841: URL: https://github.com/apache/datafusion/issues/16841
The current model used in DataFusion for measuring memory consumption assumes that the different entities that can consume memory (accumulators, joins, etc...) are the ones owning the data. This does not exactly match with how memory is managed in arrow-rs, where underlying data buffers might be referenced more than 1 time in different parts of DataFusion. For example: `ArrayAggAccumulator` accumulates data by storing `ArrayRef`s, so even if it's accumulating and retaining data, the actual memory was allocated before `ArrayAggAccumulator` came into play, and the accumulator is only adding a reference to it: https://github.com/apache/datafusion/blob/ac407a19e030bfee092a0992093b886bde86d97e/datafusion/functions-aggregate/src/array_agg.rs#L214 For that case, one could argue `ArrayAggAccumulator` is not really consuming any memory, as it's not performing new allocations, and the underlying data was there potentially even before the `ArrayAggAccumulator` was instantiated. There has been several attempts in the past towards addressing memory accounting issues in DataFusion code - https://github.com/apache/datafusion/pull/15924 - https://github.com/apache/datafusion/pull/16346 - https://github.com/apache/datafusion/pull/16519 - https://github.com/apache/datafusion/pull/16816 Some imply copying/compacting just the necessary slice of data from the underlying buffer (`ScalarValue::compact`) so that it's actually owned by the consumer, but in certain cases that could take a hit to performance. --- The main point about this issue is to start a conversation around what could be the ideal approach for memory counting: - Is copying/compacting accumulated data and calling `get_array_memory_size()` or storing array references and calling `get_slice_memory_size()` acceptable for measuring memory consumption? - Should the memory counting model in DataFusion be expanded so that it does not take into account just memory consumed, but also memory retained because of references to shared buffers -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org