Re: Hashing and equivalence of datasets

Niranda Perera Fri, 03 Dec 2021 16:38:54 -0800

Ah! I think I understand your usecase now. I don't think it would be that
trivial to feed blocks of memory to sha-256. Arrow Arrays consist of
multiple buffers (validity, data, offset) and the number of buffers depends
on the data type. AFAIU there's no guarantee about how these buffers are
located in physical memory. So, two identical tables A, B could have
different memory footprints in the physical memory.


Having said that, how about comparing sha-256 buffer-wise? (there could be
some issues if you have null-values)


On Fri, Dec 3, 2021 at 7:11 PM Sergii Mikhtoniuk <[email protected]>
wrote:

> Thank you for your response, Niranda.
>
> Re. padding what I had in mind is accessing individual blocks of memory
> (array chunks) and feeding them directly into sha-256. For an equality
> check it would not matter that I'm hashing column-wise instead of row-wise.
> But if there can be gaps between individual values in a chunk and if they
> can be filled with uninitialized memory - I'd have no choice but to
> traverse and hash one value at a time. Equivalence would of course require
> a different approach.
>
> Using `set difference` would not work in my case as I want two parties to
> be able to check if their datasets are identical without needing to
> transfer or disclose data. Computing hashes and only comparing those seems
> like the only way.
>
> I wonder if there would be interest in generalizing ARROW-8991 to support
> multiple hashing algorithms.
>
> I was surprised to see how little hashing of structured data comes up on
> the web. Storing datasets in content-addressable systems like IPFS may in
> future make Parquet a poor format choice as it adds a lot of "entropy" into
> data that is hard to control and can produce different binary layout for
> the same dataset every time it's serialized.
>
> Best,
> - Sergii
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Re: Hashing and equivalence of datasets

Reply via email to