viirya commented on code in PR #424:
URL: https://github.com/apache/datafusion-comet/pull/424#discussion_r1600867066


##########
core/src/execution/datafusion/spark_hash.rs:
##########
@@ -193,27 +241,67 @@ macro_rules! hash_array_decimal {
 fn create_hashes_dictionary<K: ArrowDictionaryKeyType>(
     array: &ArrayRef,
     hashes_buffer: &mut [u32],
+    multi_col: bool,
 ) -> Result<()> {
     let dict_array = 
array.as_any().downcast_ref::<DictionaryArray<K>>().unwrap();
+    if multi_col {
+        // unpack the dictionary array as each row may have a different hash 
input
+        let unpacked = take(dict_array.values().as_ref(), dict_array.keys(), 
None)?;
+        create_hashes(&[unpacked], hashes_buffer)?;
+    } else {
+        // Hash each dictionary value once, and then use that computed
+        // hash for each key value to avoid a potentially expensive
+        // redundant hashing for large dictionary elements (e.g. strings)
+        let dict_values = Arc::clone(dict_array.values());
+        // same initial seed as Spark
+        let mut dict_hashes = vec![42; dict_values.len()];
+        create_hashes(&[dict_values], &mut dict_hashes)?;

Review Comment:
   But when we compute hashes, next hash is computed based on current value and 
 previous hash value. But when we compute hashes for dictionary values, we 
compute it based on "dictionary values" order, instead of the row value order.
   
   In other words, we compute the hashes of dictionary values like:
   
   ```
   dict_hashes[0] = hash(2, init_seed)
   dict_hashes[1] = hash(1, dict_hashes[0])
   dict_hashes[2] = hash(4, dict_hashes[1]) 
   dict_hashes[3] = hash(3, dict_hashes[2]) 
   dict_hashes[4] = hash(5, dict_hashes[3]) 
   ```
   
   So,
   
   ```
   row_id = 0, dict_key = 1, dict_hashes[1] = hash(1, dict_hashes[0]) but it 
should be hash(1, init_seed)
   row_id = 1, dict_key = 0, dict_hashes[0] = hash(2, init_seed) but it should 
be hash(2, hash(1, init_seed))
   row_id = 2, dict_key = 0 dict_hashes[0] = hash(2, init_seed) but it should 
be hash(2, hash(2, hash(1, init_seed)))
   ...
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to