adriangb commented on code in PR #18718:
URL: https://github.com/apache/datafusion/pull/18718#discussion_r2535649120
##########
datafusion/common/src/hash_utils.rs:
##########
@@ -329,6 +329,40 @@ where
Ok(())
}
+#[cfg(not(feature = "force_hash_collisions"))]
+fn hash_union_array(
+ array: &UnionArray,
+ random_state: &RandomState,
+ hashes_buffer: &mut [u64],
+) -> Result<()> {
+ use std::collections::HashMap;
+
+ let DataType::Union(union_fields, _mode) = array.data_type() else {
+ unreachable!()
+ };
+
+ let mut child_hashes = HashMap::with_capacity(union_fields.len());
+
+ for (type_id, _field) in union_fields.iter() {
+ let child = array.child(type_id);
+ let mut child_hash_buffer = vec![0; child.len()];
+ create_hashes([child], random_state, &mut child_hash_buffer)?;
+
+ child_hashes.insert(type_id, child_hash_buffer);
+ }
Review Comment:
I think that's reasonable - things going from broken -> working generally
shouldn't be subject to perf consideration.
But also yes @Jefffrey I think your insight is correct in general. We were
talking in https://github.com/apache/datafusion/pull/18449 as well about making
some APIs to allow using all of the logic in `hash_utils` having values one by
one (e.g. build a specialized `Hasher<T>` sort of thing) that would allow going
one by one. Even if that were better performing right now there's no APIs to do
that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]