In case someone stumbles upon this thread in future - I decided to continue using logical hashing for now.
I've created a new Rust crate `arrow-digest` [1] that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language. I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on [2]. Any feedback on the algorithm would be much appreciated! In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability. [1] https://github.com/sergiimk/arrow-digest [2] https://github.com/kamu-data/kamu-cli Cheers, Sergii
