Re: [PR] Rich t kid/implement multi dictionary aggr [datafusion]

via GitHub Mon, 22 Jun 2026 11:03:22 -0700


Rich-T-kid commented on PR #22983:
URL: https://github.com/apache/datafusion/pull/22983#issuecomment-4771299066


   # Still iterating on the PR, here are some idea for speed ups **(intern 
path)**
   ## use VecDeque<T> instead of Vec<T>
   
   - This should avoid the **O(n)** that happens for `Vec<T>.drain()`
   
   ## adjust the row-tuple cache to be configurable 
   
   -  currently set to 10,000, once it reaches this size the elements that were 
written to remain there but are never read again. 
   - need to strike a balance between having a cache for per batch keys as well 
as avoiding ever growing memory in the worst case
   - Stop inserting after sparse-cache cap, but keep looking up
   - Mabey a heap of the top K elements? K can be set as through SessionConfig?
   
   ## avoid current double allocation spent for option<usize>
   
   - options that wrap non pointer types generally take up more space. 
Option<usize> takes up 16 bytes instead of just 8
   - mabey use `Option<NonZeroUsize>` 🤔 
   
   ## bitpack key-tuples cache
   
   - instead of using a slice of usize (u64) bit back the combination of tuples 
into one u64 or 128
   - need to do more research on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Rich t kid/implement multi dictionary aggr [datafusion]

Reply via email to