[GitHub] [arrow-datafusion] alamb commented on issue #7064: Improve aggregate performance by special casing single string group by

via GitHub Thu, 03 Aug 2023 06:11:35 -0700


alamb commented on issue #7064:
URL: 
https://github.com/apache/arrow-datafusion/issues/7064#issuecomment-1663959700


   > anything I missed here?
   
   I think that would get most of the benefit. 
   
   Another potential optimization would be to potentially use the 'small string 
optimization' so the hash table comparison could be done inline most of the 
time without consulting the actual string values
   
   So in the hash table store not only the group_index but also another 12 
bytes:
   
   ```
   0-3: length of the string group key value (u32)
   4-7: first four bytes of the string value itself (u32)
   8-11: offset into string buffer (u32)
   ```
   
   That way group key comparisons are faster because:
   1. If the first 8 bytes are different you know the group value is different
   2. We can check the actual values in the string builder without an extra 
level of computation on the offset buffer
   
   ```
   
   Perhaps we could learn from the `View` implementation @tustvold was working 
on here
   
   
https://github.com/apache/arrow-rs/pull/4585/files#diff-694565dedb86d29ae2474ae09d51867a98a534543a45d79fcc3506b2958b73baR26
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #7064: Improve aggregate performance by special casing single string group by

Reply via email to