Dandandan commented on PR #16359:
URL: https://github.com/apache/datafusion/pull/16359#issuecomment-2961277605

   > Thank you, this solves the memory overcounting issue across batches. I got 
some suggestions/questions.
   > 
   > This new pool implementation might cause some issue for 
`MemoryReservation` API: If we first count the memory usage of several arrays 
by `grow_with_arrays()` interface, and then use the following functions to 
release (like `reservation.resize(0)`), memory used value will be subtracted, 
but the inner hash table won't be cleared
   > 
   > 
https://github.com/apache/datafusion/blob/f35416e0746556a6d70ba346e19d94b08d70a43a/datafusion/execution/src/memory_pool/mod.rs#L308-L314
   > 
   > 
   > 
https://github.com/apache/datafusion/blob/f35416e0746556a6d70ba346e19d94b08d70a43a/datafusion/execution/src/memory_pool/mod.rs#L345-L351
   > 
   > This causes a potential inconsistent state, maybe we can use another 
semantics for the `MemoryReservation` API:
   > 
   > * Memory for arrays can only be managed with 
`grow_with_arrays()/shrink_with_arrays()`
   > * Other interfaces like `shrink()/grow()` will used for other memory usage
   > 
   > And the implementation will be like
   > 
   > ```
   > #[derive(Debug)]
   > pub struct GreedyMemoryPoolWithTracking {
   >     pool_size: usize,
   >     used_others: AtomicUsize, // managed with non-array APIs
   >     
   >     used_array: AtomicUsize, // ref-counted with buffer addr stored in the 
hash table
   >     references: Mutex<HashMap<usize, usize>>,
   > }
   > ```
   
   Hi
   
   > Thank you, this solves the memory overcounting issue across batches. I got 
some suggestions/questions.
   > 
   > This new pool implementation might cause some issue for 
`MemoryReservation` API: If we first count the memory usage of several arrays 
by `grow_with_arrays()` interface, and then use the following functions to 
release (like `reservation.resize(0)`), memory used value will be subtracted, 
but the inner hash table won't be cleared
   > 
   > 
https://github.com/apache/datafusion/blob/f35416e0746556a6d70ba346e19d94b08d70a43a/datafusion/execution/src/memory_pool/mod.rs#L308-L314
   > 
   > 
   > 
https://github.com/apache/datafusion/blob/f35416e0746556a6d70ba346e19d94b08d70a43a/datafusion/execution/src/memory_pool/mod.rs#L345-L351
   > 
   > This causes a potential inconsistent state, maybe we can use another 
semantics for the `MemoryReservation` API:
   > 
   > * Memory for arrays can only be managed with 
`grow_with_arrays()/shrink_with_arrays()`
   > * Other interfaces like `shrink()/grow()` will used for other memory usage
   > 
   > And the implementation will be like
   > 
   > ```
   > #[derive(Debug)]
   > pub struct GreedyMemoryPoolWithTracking {
   >     pool_size: usize,
   >     used_others: AtomicUsize, // managed with non-array APIs
   >     
   >     used_array: AtomicUsize, // ref-counted with buffer addr stored in the 
hash table
   >     references: Mutex<HashMap<usize, usize>>,
   > }
   > ```
   
   Thanks for the feedback!
   
   Yes, tracking the size seperately makes sense, I'll  change that!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to