Re: Hashing and equivalence of datasets

Sergii Mikhtoniuk Wed, 08 Dec 2021 19:47:05 -0800

In case someone stumbles upon this thread in future - I decided to continue
using logical hashing for now.


I've created a new Rust crate `arrow-digest` [1] that implements the stable
hashing for Arrow arrays and record batches and tries hard to hide the
encoding-related differences. The crate's README describes the hashing
algorithm if someone finds it useful and wants to implement it in another
language.

I'll continue to expand the set of supported types as I'm integrating it
into the decentralized data processing tool I'm working on [2]. Any
feedback on the algorithm would be much appreciated!

In the long term, I'm not sure logical hashing is the best way forward - a
subset of Parquet that makes some efficiency sacrifices just to make file
layout deterministic might be a better choice for content-addressability.

[1] https://github.com/sergiimk/arrow-digest
[2] https://github.com/kamu-data/kamu-cli

Cheers,
Sergii

Re: Hashing and equivalence of datasets

Reply via email to