[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #6679: Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries)

via GitHub Fri, 16 Jun 2023 12:28:01 -0700


Dandandan commented on code in PR #6679:
URL: https://github.com/apache/arrow-datafusion/pull/6679#discussion_r1232651290



##########
datafusion/core/src/physical_plan/joins/hash_join_utils.rs:
##########
@@ -36,24 +36,47 @@ use crate::physical_plan::joins::utils::{JoinFilter, 
JoinSide};
 use datafusion_common::Result;
 
 // Maps a `u64` hash value based on the build side ["on" values] to a list of 
indices with this key's value.
-//
-// Note that the `u64` keys are not stored in the hashmap (hence the `()` as 
key), but are only used
-// to put the indices in a certain bucket.
 // By allocating a `HashMap` with capacity for *at least* the number of rows 
for entries at the build side,
 // we make sure that we don't have to re-hash the hashmap, which needs access 
to the key (the hash in this case) value.
 // E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8 
for hash value 1
 // As the key is a hash value, we need to check possible hash collisions in 
the probe stage
 // During this stage it might be the case that a row is contained the same 
hashmap value,
 // but the values don't match. Those are checked in the [equal_rows] macro
-// TODO: speed up collision check and move away from using a hashbrown HashMap
+// The indices (values) are stored in a separate chained list stored in the 
`Vec<u64>`.
+// The first value (+1) is stored in the hashmap, whereas the next value is 
stored in array at the position value.
+// The chain can be followed until the value "0" has been reached, meaning the 
end of the list.

Review Comment:
   Yeah, makes sense, will add it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #6679: Improve performance/memory usage of HashJoin datastructure (5-15% improvement on selected TPC-H queries)

Reply via email to