gabotechs commented on issue #16841:
URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3112665952

   One idea @notfilippo mentioned, is that arrow-rs could offer some kind of 
API for tracking allocations. As it's arrow-rs the one who knows when a buffer 
is allocated, when it's reference count increases, and when it's freed, maybe 
it's a good candidate for offering an API that allows consumers to track memory 
usage. Just throwing the idea for if something sticks, but I imagine that going 
down that path would require a fair amount of non-trivial work.
   
   A more immediate problem that would be nice to get a shorter term solution 
is the problem of double-counting memory. There are several places in the 
codebase that are calling `.get_array_memory_size()` over arrays baked by 
shared buffers that can result in over accounting, some places I have 
identified:
   
   
https://github.com/apache/datafusion/blob/386985788a35e474b829d46c9dccdfd7c5117d98/datafusion/common/src/scalar/mod.rs#L3384-L3386
   
   
https://github.com/apache/datafusion/blob/47f75ef1205c0f6abb75add01888817b7270ede0/datafusion/functions-aggregate/src/array_agg.rs#L381
   
   
https://github.com/apache/datafusion/blob/d3cacac0181c2d235abeae71123cd27eb9e6976a/datafusion/physical-plan/src/joins/nested_loop_join.rs#L625
   
   
https://github.com/apache/datafusion/blob/d3cacac0181c2d235abeae71123cd27eb9e6976a/datafusion/physical-plan/src/joins/cross_join.rs#L202
   
   
https://github.com/apache/datafusion/blob/d3cacac0181c2d235abeae71123cd27eb9e6976a/datafusion/physical-plan/src/joins/sort_merge_join.rs#L818-L821
   
   
https://github.com/apache/datafusion/blob/d3cacac0181c2d235abeae71123cd27eb9e6976a/datafusion/physical-plan/src/joins/symmetric_hash_join.rs#L1158
   
   
https://github.com/apache/datafusion/blob/4bc66c80d560581f527dee5774fb4f0479786d3e/datafusion/physical-plan/src/repartition/mod.rs#L950
   
   Probably there are some more.
   
   I wonder if an acceptable short term solution could be to account for memory 
occupied by an `ArrayRef` using `.get_slice_memory_size()` rather than 
`.get_array_memory_size()`. It's still not going to be correct, but I it might 
be less wrong than double-counting memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to