alamb commented on issue #7064: URL: https://github.com/apache/arrow-datafusion/issues/7064#issuecomment-1663959700
> anything I missed here? I think that would get most of the benefit. Another potential optimization would be to potentially use the 'small string optimization' so the hash table comparison could be done inline most of the time without consulting the actual string values So in the hash table store not only the group_index but also another 12 bytes: ``` 0-3: length of the string group key value (u32) 4-7: first four bytes of the string value itself (u32) 8-11: offset into string buffer (u32) ``` That way group key comparisons are faster because: 1. If the first 8 bytes are different you know the group value is different 2. We can check the actual values in the string builder without an extra level of computation on the offset buffer ``` Perhaps we could learn from the `View` implementation @tustvold was working on here https://github.com/apache/arrow-rs/pull/4585/files#diff-694565dedb86d29ae2474ae09d51867a98a534543a45d79fcc3506b2958b73baR26 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
