Dandandan opened a new issue #50:
URL: https://github.com/apache/arrow-datafusion/issues/50


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Further optimize the hash join algorithm
   
   **Describe the solution you'd like**
   There are a couple of optimizations we could implement:
   
   * Vectorize the row-equality check which now uses the `equal_rows` 
functions. We should be able to speed this up by vectorizing this, and also 
specialize it for handling non-null batches too. We probably can utilize the 
kernels `take` and `equals` here.
   * Don't use a `Hashmap` but a `Vec` (or similar) with a certain amount of 
buckets. I tried this before, but as it causes much more collisions than we 
have currently, it causes a big (3x) slowdown.
   
   **Additional context**
   
   https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
   https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to