Re: [PR] [PoC] Add API for tracking distinct buffers in `MemoryPool` by reference count [datafusion]

via GitHub Tue, 10 Jun 2025 22:18:43 -0700


Dandandan commented on PR #16359:
URL: https://github.com/apache/datafusion/pull/16359#issuecomment-2961277605


   > Thank you, this solves the memory overcounting issue across batches. I got 
some suggestions/questions.
   > 
   > This new pool implementation might cause some issue for 
`MemoryReservation` API: If we first count the memory usage of several arrays 
by `grow_with_arrays()` interface, and then use the following functions to 
release (like `reservation.resize(0)`), memory used value will be subtracted, 
but the inner hash table won't be cleared
   > 
   > 
https://github.com/apache/datafusion/blob/f35416e0746556a6d70ba346e19d94b08d70a43a/datafusion/execution/src/memory_pool/mod.rs#L308-L314
   > 
   > 
   > 
https://github.com/apache/datafusion/blob/f35416e0746556a6d70ba346e19d94b08d70a43a/datafusion/execution/src/memory_pool/mod.rs#L345-L351
   > 
   > This causes a potential inconsistent state, maybe we can use another 
semantics for the `MemoryReservation` API:
   > 
   > * Memory for arrays can only be managed with 
`grow_with_arrays()/shrink_with_arrays()`
   > * Other interfaces like `shrink()/grow()` will used for other memory usage
   > 
   > And the implementation will be like
   > 
   > ```
   > #[derive(Debug)]
   > pub struct GreedyMemoryPoolWithTracking {
   >     pool_size: usize,
   >     used_others: AtomicUsize, // managed with non-array APIs
   >     
   >     used_array: AtomicUsize, // ref-counted with buffer addr stored in the 
hash table
   >     references: Mutex<HashMap<usize, usize>>,
   > }
   > ```
   
   Hi
   
   > Thank you, this solves the memory overcounting issue across batches. I got 
some suggestions/questions.
   > 
   > This new pool implementation might cause some issue for 
`MemoryReservation` API: If we first count the memory usage of several arrays 
by `grow_with_arrays()` interface, and then use the following functions to 
release (like `reservation.resize(0)`), memory used value will be subtracted, 
but the inner hash table won't be cleared
   > 
   > 
https://github.com/apache/datafusion/blob/f35416e0746556a6d70ba346e19d94b08d70a43a/datafusion/execution/src/memory_pool/mod.rs#L308-L314
   > 
   > 
   > 
https://github.com/apache/datafusion/blob/f35416e0746556a6d70ba346e19d94b08d70a43a/datafusion/execution/src/memory_pool/mod.rs#L345-L351
   > 
   > This causes a potential inconsistent state, maybe we can use another 
semantics for the `MemoryReservation` API:
   > 
   > * Memory for arrays can only be managed with 
`grow_with_arrays()/shrink_with_arrays()`
   > * Other interfaces like `shrink()/grow()` will used for other memory usage
   > 
   > And the implementation will be like
   > 
   > ```
   > #[derive(Debug)]
   > pub struct GreedyMemoryPoolWithTracking {
   >     pool_size: usize,
   >     used_others: AtomicUsize, // managed with non-array APIs
   >     
   >     used_array: AtomicUsize, // ref-counted with buffer addr stored in the 
hash table
   >     references: Mutex<HashMap<usize, usize>>,
   > }
   > ```
   
   Thanks for the feedback!
   
   Yes, tracking the size seperately makes sense, I'll  change that!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [PoC] Add API for tracking distinct buffers in `MemoryPool` by reference count [datafusion]

Reply via email to