adriangb commented on code in PR #18718:
URL: https://github.com/apache/datafusion/pull/18718#discussion_r2535649120


##########
datafusion/common/src/hash_utils.rs:
##########
@@ -329,6 +329,40 @@ where
     Ok(())
 }
 
+#[cfg(not(feature = "force_hash_collisions"))]
+fn hash_union_array(
+    array: &UnionArray,
+    random_state: &RandomState,
+    hashes_buffer: &mut [u64],
+) -> Result<()> {
+    use std::collections::HashMap;
+
+    let DataType::Union(union_fields, _mode) = array.data_type() else {
+        unreachable!()
+    };
+
+    let mut child_hashes = HashMap::with_capacity(union_fields.len());
+
+    for (type_id, _field) in union_fields.iter() {
+        let child = array.child(type_id);
+        let mut child_hash_buffer = vec![0; child.len()];
+        create_hashes([child], random_state, &mut child_hash_buffer)?;
+
+        child_hashes.insert(type_id, child_hash_buffer);
+    }

Review Comment:
   I think that's reasonable - things going from broken -> working generally 
shouldn't be subject to perf consideration.
   
   But also yes @Jefffrey I think your insight is correct in general. We were 
talking in https://github.com/apache/datafusion/pull/18449 as well about making 
some APIs to allow using all of the logic in `hash_utils` having values one by 
one (e.g. build a specialized `Hasher<T>` sort of thing) that would allow going 
one by one. Even if that were better performing right now there's no APIs to do 
that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to