Jefffrey commented on code in PR #18718:
URL: https://github.com/apache/datafusion/pull/18718#discussion_r2533083658


##########
datafusion/common/src/hash_utils.rs:
##########
@@ -1000,4 +1038,83 @@ mod tests {
 
         assert_eq!(hashes1, hashes2);
     }
+
+    #[test]
+    #[cfg(not(feature = "force_hash_collisions"))]
+    fn create_hashes_for_sparse_union_arrays() {
+        // Create a sparse union array with int and string types
+        // In sparse mode, row i uses child_array[i]
+        // Logical array: [int(5), str("foo"), int(10), int(5)]

Review Comment:
   Perhaps a test where a child value is `null` too?



##########
datafusion/common/src/hash_utils.rs:
##########
@@ -329,6 +329,40 @@ where
     Ok(())
 }
 
+#[cfg(not(feature = "force_hash_collisions"))]
+fn hash_union_array(
+    array: &UnionArray,
+    random_state: &RandomState,
+    hashes_buffer: &mut [u64],
+) -> Result<()> {
+    use std::collections::HashMap;
+
+    let DataType::Union(union_fields, _mode) = array.data_type() else {
+        unreachable!()
+    };
+
+    let mut child_hashes = HashMap::with_capacity(union_fields.len());
+
+    for (type_id, _field) in union_fields.iter() {
+        let child = array.child(type_id);
+        let mut child_hash_buffer = vec![0; child.len()];
+        create_hashes([child], random_state, &mut child_hash_buffer)?;
+
+        child_hashes.insert(type_id, child_hash_buffer);
+    }

Review Comment:
   I do wonder if there's a way to do this without hashing all rows of the 
child arrays upfront? Or does the benefit of vectorization make the upfront 
effort worth it?



##########
datafusion/common/src/hash_utils.rs:
##########
@@ -329,6 +329,40 @@ where
     Ok(())
 }
 
+#[cfg(not(feature = "force_hash_collisions"))]
+fn hash_union_array(
+    array: &UnionArray,
+    random_state: &RandomState,
+    hashes_buffer: &mut [u64],
+) -> Result<()> {
+    use std::collections::HashMap;
+
+    let DataType::Union(union_fields, _mode) = array.data_type() else {
+        unreachable!()
+    };
+
+    let mut child_hashes = HashMap::with_capacity(union_fields.len());
+
+    for (type_id, _field) in union_fields.iter() {
+        let child = array.child(type_id);
+        let mut child_hash_buffer = vec![0; child.len()];
+        create_hashes([child], random_state, &mut child_hash_buffer)?;
+
+        child_hashes.insert(type_id, child_hash_buffer);
+    }
+
+    #[allow(clippy::needless_range_loop)]
+    for i in 0..array.len() {
+        let type_id = array.type_id(i);
+        let child_offset = array.value_offset(i);
+
+        let child_hash = child_hashes.get(&type_id).expect("invalid type_id");
+        hashes_buffer[i] = child_hash[child_offset];

Review Comment:
   I notice other hash function seem to utilize `combine_hashes` when writing 
into `hashes_buffer`:
   
   
https://github.com/apache/datafusion/blob/4198c5a74fa8ea8dbe2623145d6e38d5cbfeb28c/datafusion/common/src/hash_utils.rs#L254-L257
   
   Is this something we need to consider here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to