advancedxy commented on code in PR #424: URL: https://github.com/apache/datafusion-comet/pull/424#discussion_r1600892577
########## core/src/execution/datafusion/spark_hash.rs: ########## @@ -193,27 +241,67 @@ macro_rules! hash_array_decimal { fn create_hashes_dictionary<K: ArrowDictionaryKeyType>( array: &ArrayRef, hashes_buffer: &mut [u32], + multi_col: bool, ) -> Result<()> { let dict_array = array.as_any().downcast_ref::<DictionaryArray<K>>().unwrap(); + if multi_col { + // unpack the dictionary array as each row may have a different hash input + let unpacked = take(dict_array.values().as_ref(), dict_array.keys(), None)?; + create_hashes(&[unpacked], hashes_buffer)?; + } else { + // Hash each dictionary value once, and then use that computed + // hash for each key value to avoid a potentially expensive + // redundant hashing for large dictionary elements (e.g. strings) + let dict_values = Arc::clone(dict_array.values()); + // same initial seed as Spark + let mut dict_hashes = vec![42; dict_values.len()]; + create_hashes(&[dict_values], &mut dict_hashes)?; Review Comment: > so ``` row_id = 0, dict_key = 1, dict_hashes[1] = hash(1, dict_hashes[0]) but it should be hash(1, init_seed) row_id = 1, dict_key = 0, dict_hashes[0] = hash(2, init_seed) but it should be hash(2, hash(1, init_seed)) row_id = 2, dict_key = 0 dict_hashes[0] = hash(2, init_seed) but it should be hash(2, hash(2, hash(1, init_seed))) ... ``` Hmm, I think you may have confused with row order and column order, which indeed is quite confusing. For function call like `hash(col_a, col_b)`, the actual hash is `murmur3(col_b, murmur3(col_a, 42))` where 42 is the initial seed, the previous hash is feed as seed, the order is determined the **input cols**(parameters), but not related to the _row order_. For rows in the RecordBatch/ArrowArray, each hash is computed from scratch, there's no dep between values in different rows. So the following compute step is not how it currently work > dict_hashes[0] = hash(2, init_seed) dict_hashes[1] = hash(1, dict_hashes[0]) dict_hashes[2] = hash(4, dict_hashes[1]) dict_hashes[3] = hash(3, dict_hashes[2]) dict_hashes[4] = hash(5, dict_hashes[3]) Instead, currently it works as ``` dict_hashes[0] = hash(2, init_seed) dict_hashes[1] = hash(1, init_seed) dict_hashes[2] = hash(4, init_seed) dict_hashes[3] = hash(3, init_seed) dict_hashes[4] = hash(5, init_seed) ``` The hashes of the first col are then feed as seed for the next cols in the rows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org