Dandandan commented on code in PR #6679:
URL: https://github.com/apache/arrow-datafusion/pull/6679#discussion_r1232368902
##########
datafusion/core/src/physical_plan/joins/hash_join_utils.rs:
##########
@@ -36,24 +36,47 @@ use crate::physical_plan::joins::utils::{JoinFilter,
JoinSide};
use datafusion_common::Result;
// Maps a `u64` hash value based on the build side ["on" values] to a list of
indices with this key's value.
-//
-// Note that the `u64` keys are not stored in the hashmap (hence the `()` as
key), but are only used
-// to put the indices in a certain bucket.
// By allocating a `HashMap` with capacity for *at least* the number of rows
for entries at the build side,
// we make sure that we don't have to re-hash the hashmap, which needs access
to the key (the hash in this case) value.
// E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8
for hash value 1
// As the key is a hash value, we need to check possible hash collisions in
the probe stage
// During this stage it might be the case that a row is contained the same
hashmap value,
// but the values don't match. Those are checked in the [equal_rows] macro
-// TODO: speed up collision check and move away from using a hashbrown HashMap
+// The indices (values) are stored in a separate chained list stored in the
`Vec<u64>`.
+// The first value (+1) is stored in the hashmap, whereas the next value is
stored in array at the position value.
+// The chain can be followed until the value "0" has been reached, meaning the
end of the list.
+// Also see chapter 5.3 of [Balancing vectorized query execution with
bandwidth-optimized
storage](https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487)
+// TODO: speed up collision checks
// https://github.com/apache/arrow-datafusion/issues/50
-pub struct JoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>);
+pub struct JoinHashMap {
+ // Stores hash value to first index
+ pub map: RawTable<(u64, u64)>,
+ // Stores indices in chained list data structure
+ pub next: Vec<u64>,
+}
+
+/// SymmetricJoinHashMap is similar to JoinHashMap, except that it stores the
indices inline, allowing it to mutate
+/// and shrink the indices.
+pub struct SymmetricJoinHashMap(pub RawTable<(u64, SmallVec<[u64; 1]>)>);
Review Comment:
@berkaysynnada not sure if `SmallVec` is optimal. It might be an improvement
to use `Vec` here as the >1 case probably occurs more often here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]