[
https://issues.apache.org/jira/browse/ARROW-11112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332337#comment-17332337
]
Andrew Lamb commented on ARROW-11112:
-------------------------------------
Migrated to github: https://github.com/apache/arrow-datafusion/issues/142
> [Rust][DataFusion] Implement vectorized hashing
> -----------------------------------------------
>
> Key: ARROW-11112
> URL: https://issues.apache.org/jira/browse/ARROW-11112
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Daniël Heres
> Priority: Major
>
> Currently, the approach of the join and hash aggregates is to create a key
> individually from the row values. However, this is far from ideal, as it
> doesn't utilize the cache vectorized nature of Arrow, but instead copies data
> into a vec, traverses multiple arrays in the inner loop, etc.
> This blog post has a summary of an approach to do this in a vectorized way.
> [https://www.cockroachlabs.com/blog/vectorized-hash-joiner/]
>
> TBD:
> We should decide/find out whether it still makes sense to use rust `HashMap`
> (with () as key?) or whether to create an own? Benefit of using hashmap is
> that there is an API, can resize automatically, and uses SIMD, and also
> exposes some lower level bits we can use here.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)