Re: Hashing and equivalence of datasets

Sergii Mikhtoniuk Fri, 03 Dec 2021 16:12:18 -0800

Thank you for your response, Niranda.

Re. padding what I had in mind is accessing individual blocks of memory
(array chunks) and feeding them directly into sha-256. For an equality
check it would not matter that I'm hashing column-wise instead of row-wise.
But if there can be gaps between individual values in a chunk and if they
can be filled with uninitialized memory - I'd have no choice but to
traverse and hash one value at a time. Equivalence would of course require
a different approach.


Using `set difference` would not work in my case as I want two parties to
be able to check if their datasets are identical without needing to
transfer or disclose data. Computing hashes and only comparing those seems
like the only way.

I wonder if there would be interest in generalizing ARROW-8991 to support
multiple hashing algorithms.

I was surprised to see how little hashing of structured data comes up on
the web. Storing datasets in content-addressable systems like IPFS may in
future make Parquet a poor format choice as it adds a lot of "entropy" into
data that is hard to control and can produce different binary layout for
the same dataset every time it's serialized.

Best,
- Sergii

Re: Hashing and equivalence of datasets

Reply via email to