[GitHub] [arrow-datafusion] Dandandan opened a new issue #50: Hash join further optimization / vectorization

GitBox Sat, 24 Apr 2021 13:05:10 -0700


Dandandan opened a new issue #50:
URL: https://github.com/apache/arrow-datafusion/issues/50



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Further optimize the hash join algorithm
   
   **Describe the solution you'd like**
   There are a couple of optimizations we could implement:
   
   * Vectorize the row-equality check which now uses the `equal_rows` 
functions. We should be able to speed this up by vectorizing this, and also 
specialize it for handling non-null batches too. We probably can utilize the 
kernels `take` and `equals` here.
   * Don't use a `Hashmap` but a `Vec` (or similar) with a certain amount of 
buckets. I tried this before, but as it causes much more collisions than we 
have currently, it causes a big (3x) slowdown.
   
   **Additional context**
   
   https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
   https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan opened a new issue #50: Hash join further optimization / vectorization

Reply via email to