Rich-T-kid commented on PR #21765: URL: https://github.com/apache/datafusion/pull/21765#issuecomment-4293402012
**speed up** : 24 **Slower** : 6 **no change** : 36 last optimize that comes to mind is updating **normalize_dict_hash()** by eliminating per-insert allocations. Currently, **normalize_dict_hash()** creates a **HashMap<Vec<u8>, usize>** where each unique key's raw bytes are heap allocated via **.to_vec()** before insertion. This allocation occurs once per unique value but is unnecessary since the underlying Arrow buffer already owns the bytes. The plan is to pre-compute a **Vec<Option<Cow<[u8]>>>** of raw byte slices for all accessed values upfront, allowing the hashmap to store **&[u8]** references instead of owned **Vec<u8>** keys, eliminating the per-insert allocations entirely. In a perfect world I would benchmark this to see if eliminating the **.to_vec()** allocations outweighs the cost of the upfront pass over the accessed value indices to build the slice cache. I'll implement both approaches, benchmark them, and push the more optimized version. with that being said I think these results show a generally a large improvement over the current approach to dealing with dictionary encoded columns in data-fusion. @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
