Re: [PR] Optimize Dictionary groupings [datafusion]

via GitHub Tue, 21 Apr 2026 20:57:39 -0700


Rich-T-kid commented on PR #21765:
URL: https://github.com/apache/datafusion/pull/21765#issuecomment-4293402012


   **speed up** : 24
   **Slower** : 6
   **no change** : 36
   last optimize that comes to mind is updating **normalize_dict_hash()** by 
eliminating per-insert allocations.
   Currently, **normalize_dict_hash()** creates a **HashMap<Vec<u8>, usize>** 
where each unique key's raw bytes are heap allocated via **.to_vec()** before 
insertion. This allocation occurs once per unique value but is unnecessary 
since the underlying Arrow buffer already owns the bytes.
   The plan is to pre-compute a **Vec<Option<Cow<[u8]>>>** of raw byte slices 
for all accessed values upfront, allowing the hashmap to store **&[u8]** 
references instead of owned **Vec<u8>** keys, eliminating the per-insert 
allocations entirely. 
   
   In a perfect world I would benchmark this to see if eliminating the 
**.to_vec()** allocations outweighs the cost of the upfront pass over the 
accessed value indices to build the slice cache. I'll implement both 
approaches, benchmark them, and push the more optimized version.
   
   with that being said I think these results show a generally a large 
improvement over the current approach to dealing with dictionary encoded 
columns in data-fusion. @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimize Dictionary groupings [datafusion]

Reply via email to