[GitHub] [arrow-datafusion] alamb opened a new issue #850: Optimize hash_aggregate when there are no null group keys

GitBox Tue, 10 Aug 2021 14:23:59 -0700


alamb opened a new issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   The code in hash_aggregate.rs is general and works for data with and without 
nulls. However there are optimizations that can be done. One such optimization 
is suggested by @andygrove  and @Dandandan  on 
https://github.com/apache/arrow-datafusion/pull/844#discussion_r686066032, 
namely add an optimized code path when there are no NULL values in the input 
groups that will avoid the cost of checking for null on each group.
   
   While this might sound trivial the null check is on the hot path (done for 
every single row that is grouped) so removing it may improve performance by a 
measurable amount.
   
   **Describe the solution you'd like**
   1. A new function or parameter in `ScalarVaue::eq_array` (e.g. 
`ScalarValue::eq_array_non_null`) that assumes the input has no nulls and does 
not check `Array::is_valid`
   2. A check in hash_aggregate if the null count in all group columns is 0 and 
invokes the specialized version of ScalarValue::eq_array_non_null if so
   3. Some sort of performance benchmark results showing that it improves 
grouping performance (there is a list of benchmarks on #808 that might be able 
to inspire you)
   
   **Describe alternatives you've considered**
   The performance benefit may not be worth the additional code complexity, but 
we won't know until we try
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue #850: Optimize hash_aggregate when there are no null group keys

Reply via email to