viirya commented on code in PR #424:
URL: https://github.com/apache/datafusion-comet/pull/424#discussion_r1600503628


##########
core/src/execution/datafusion/spark_hash.rs:
##########
@@ -238,55 +241,67 @@ macro_rules! hash_array_decimal {
 fn create_hashes_dictionary<K: ArrowDictionaryKeyType>(
     array: &ArrayRef,
     hashes_buffer: &mut [u32],
+    multi_col: bool,
 ) -> Result<()> {
     let dict_array = 
array.as_any().downcast_ref::<DictionaryArray<K>>().unwrap();
-
-    // Hash each dictionary value once, and then use that computed
-    // hash for each key value to avoid a potentially expensive
-    // redundant hashing for large dictionary elements (e.g. strings)
-    let dict_values = Arc::clone(dict_array.values());
-    let mut dict_hashes = vec![0; dict_values.len()];
-    create_hashes(&[dict_values], &mut dict_hashes)?;
-
-    for (hash, key) in hashes_buffer.iter_mut().zip(dict_array.keys().iter()) {
-        if let Some(key) = key {
-            let idx = key.to_usize().ok_or_else(|| {
-                DataFusionError::Internal(format!(
-                    "Can not convert key value {:?} to usize in dictionary of 
type {:?}",
-                    key,
-                    dict_array.data_type()
-                ))
-            })?;
-            *hash = dict_hashes[idx]
-        } // no update for Null, consistent with other hashes
+    if multi_col {
+        // unpack the dictionary array as each row may have a different hash 
input
+        let unpacked = take(dict_array.values().as_ref(), dict_array.keys(), 
None)?;
+        create_hashes(&[unpacked], hashes_buffer)?;
+    } else {

Review Comment:
   `multi_col` is true for columns after first column. Why we need to unpack 
the dictionary array for them particularly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to