Dandandan commented on a change in pull request #68: URL: https://github.com/apache/arrow-datafusion/pull/68#discussion_r642554876
########## File path: datafusion/src/physical_plan/hash_aggregate.rs ########## @@ -339,6 +325,36 @@ pin_project! { } } +fn hash_(group_values: &[ArrayRef]) -> Result<MutableBuffer<u64>> { + // compute the hashes + // todo: we should be able to use `MutableBuffer<u64>` to compute the hash and ^ them without + // allocating all the hashes before ^ them + let hashes = group_values + .iter() + .map(|x| { + let a = match x.data_type() { + DataType::Dictionary(_, d) => { + // todo: think about how to perform this more efficiently + // * first hash, then unpack + // * do not unpack at all, and instead figure out a way to leverage dictionary-encoded. + let unpacked = arrow2::compute::cast::cast(x.as_ref(), d)?; + arrow2::compute::hash::hash(unpacked.as_ref()) + } + _ => arrow2::compute::hash::hash(x.as_ref()), + }; + Ok(a?) + }) + .collect::<Result<Vec<_>>>()?; + let hash = MutableBuffer::<u64>::from(hashes[0].values()); + + Ok(hashes.iter().skip(1).fold(hash, |mut acc, x| { + acc.iter_mut() + .zip(x.values().iter()) + .for_each(|(hash, other)| *hash ^= other); Review comment: Simple example is that combining two same hashes always map to `0` - regardless of which value they have and in which order they appear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org