[ 
https://issues.apache.org/jira/browse/ARROW-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11300:
-----------------------------------
    Labels: pull-request-available  (was: )

> [Rust][DataFusion] Improve hash aggregate performance with large number of 
> groups in 
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-11300
>                 URL: https://issues.apache.org/jira/browse/ARROW-11300
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Daniël Heres
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2021-01-18-13-00-36-685.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, hash aggregates are performing well when having a small number of 
> output groups, but the results on db-benchmark 
> [https://github.com/h2oai/db-benchmark/pull/182] test on data with a high 
> number of output groups.
> [https://github.com/apache/arrow/pull/9234] improved the situation a bit, but 
> DataFusion is still much slower than even the slowest result when comparing 
> to the published results.
> This seems mostly having to do with the way we use individual key/groups.
> For each new key, we _take_ the indices of the group, resulting in lots of 
> small allocations and cache unfriendliness and other overhead if we have many 
> keys with only a small (just 1-2) number of rows per group in a batch. Also 
> the indices are converted from a Vec to an Array, making the situation worse 
> (accounts for ~22% of the instructions on the master branch!), other 
> profiling results seem to be from related allocations too.
> To make it efficient for tiny groups, we should probably change the hash 
> aggregate algorithm to _take_ based on _all_ indices from the batch in one 
> go, and "slice" into the resulting array for the individual accumulators.
>  
> Here is some profiling info of the db-benchmark questions 1-5 against master:
> !image-2021-01-18-13-00-36-685.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to