Dandandan commented on code in PR #6679:
URL: https://github.com/apache/arrow-datafusion/pull/6679#discussion_r1232651290
##########
datafusion/core/src/physical_plan/joins/hash_join_utils.rs:
##########
@@ -36,24 +36,47 @@ use crate::physical_plan::joins::utils::{JoinFilter,
JoinSide};
use datafusion_common::Result;
// Maps a `u64` hash value based on the build side ["on" values] to a list of
indices with this key's value.
-//
-// Note that the `u64` keys are not stored in the hashmap (hence the `()` as
key), but are only used
-// to put the indices in a certain bucket.
// By allocating a `HashMap` with capacity for *at least* the number of rows
for entries at the build side,
// we make sure that we don't have to re-hash the hashmap, which needs access
to the key (the hash in this case) value.
// E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8
for hash value 1
// As the key is a hash value, we need to check possible hash collisions in
the probe stage
// During this stage it might be the case that a row is contained the same
hashmap value,
// but the values don't match. Those are checked in the [equal_rows] macro
-// TODO: speed up collision check and move away from using a hashbrown HashMap
+// The indices (values) are stored in a separate chained list stored in the
`Vec<u64>`.
+// The first value (+1) is stored in the hashmap, whereas the next value is
stored in array at the position value.
+// The chain can be followed until the value "0" has been reached, meaning the
end of the list.
Review Comment:
Yeah, makes sense, will add it!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]