viirya commented on code in PR #424:
URL: https://github.com/apache/datafusion-comet/pull/424#discussion_r1600836820


##########
core/src/execution/datafusion/spark_hash.rs:
##########
@@ -193,27 +241,67 @@ macro_rules! hash_array_decimal {
 fn create_hashes_dictionary<K: ArrowDictionaryKeyType>(
     array: &ArrayRef,
     hashes_buffer: &mut [u32],
+    multi_col: bool,
 ) -> Result<()> {
     let dict_array = 
array.as_any().downcast_ref::<DictionaryArray<K>>().unwrap();
+    if multi_col {
+        // unpack the dictionary array as each row may have a different hash 
input
+        let unpacked = take(dict_array.values().as_ref(), dict_array.keys(), 
None)?;
+        create_hashes(&[unpacked], hashes_buffer)?;
+    } else {
+        // Hash each dictionary value once, and then use that computed
+        // hash for each key value to avoid a potentially expensive
+        // redundant hashing for large dictionary elements (e.g. strings)
+        let dict_values = Arc::clone(dict_array.values());
+        // same initial seed as Spark
+        let mut dict_hashes = vec![42; dict_values.len()];
+        create_hashes(&[dict_values], &mut dict_hashes)?;

Review Comment:
   Dictionary values are used to pre-compute hashes here. But the key/value 
orders could be different to the value orders.
   
   For example, if dictionary values are [2, 1, 4, 3, 5], dictionary keys are 
[1, 0, 0, 1, 2, 4, 3, 3]. We pre-compute hashes for values in the order of [2, 
1, 4, 3, 5], but the unpack values are actually [1, 2, 2, 1, 4, 5, 3, 3].



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to