rluvaton opened a new issue, #6996: URL: https://github.com/apache/arrow-rs/issues/6996
I just read the [Photon](https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf) paper from 2022 and saw their vectorized implementation for hash table, I also noticed that someone opened an issue in DataFusion https://github.com/apache/datafusion/issues/7095 for implementing it for group aggregate **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I would like to have `HashSet`/`HashMap` that would support all Hash table functionality but with Arrays as input. Problem: DataFusion has `array_agg` with distinct support, if you look at the implementation it just keep adding to `HashSet` https://github.com/apache/datafusion/blob/6c9355d5be8b6045865fed67cb6d028b2dfc2e06/datafusion/functions-aggregate/src/array_agg.rs#L268-L281 this works but can be improved with computing all the hashes first, and then do probing in a tight loop **Describe the solution you'd like** It would be helpful to use there and in other places a HashSet/HashMap that can 1. Insert all values from an array 2. Check all values in array exists and return a BooleanArray for the result 3. Get all the values that match each key in the array **Describe alternatives you've considered** Implement it everywhere that need HashMap/HashSet or create external crate **Additional context** The way I see it there will be couple of implementation 1. Primitive/boolean 2. Bytes 3. Generic that will use `arrow-row` I'm willing to create a PR for that. I see it as using internally the hashbrown raw API to implement that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
