Re: Hashing and equivalence of datasets

Jorge Cardoso Leitão Fri, 03 Dec 2021 22:02:00 -0800

AFAIK hashing in this context needs to be done on a slot by slot basis,
just like array equality, as any item on a null slot has a value on the
buffer that is undetermined.


E.g. the layout of a primitive array [1, 2, None, 4] is two buffer regions:
* [1, 2, ?, 4] and
* [true, true, false, true] (in bitmap)

The question mark can be any number. Hashing needs to skip the "?", which
is achieved by iterating over [(1, true), (2, true), (?, false), (4, true)]
and using a unique hash for the false case (representing the None)

Best,
Jorge



On Sat, Dec 4, 2021 at 6:26 AM Weston Pace <[email protected]> wrote:

> One possibility could be to calculate the hash of the logical data
> when serializing and then put the hash in the metadata.
>
> > I'm not even sure this can actually happen ... After all buffers should
> only carry primitive types (not some complex structs) and they all seem to
> be 16/32/64/128 bit long and should produce "gapless" buffers.
>
> Arrow buffers are aligned on 8 or 64 byte boundaries and there is a
> preference to align on 64 byte boundaries.  So I think gaps/padding is
> a real possibility.
>
> On Fri, Dec 3, 2021 at 3:05 PM Sergii Mikhtoniuk <[email protected]>
> wrote:
> >
> > Apologies for the confusion, I was using wrong terminology. When I was
> talking about "array chunks" I meant Buffers - contiguous slices of memory
> with nullability, offsets, or value data.
> >
> > If Arrow is not explicit about Buffers having to be memset to zero
> before use - whenever the size of the vale is not a multiple of its
> alignment we would have garbage in between, messing up the stability of a
> buffer-wise hash.
> >
> > I'm not even sure this can actually happen ... After all buffers should
> only carry primitive types (not some complex structs) and they all seem to
> be 16/32/64/128 bit long and should produce "gapless" buffers.
>

Re: Hashing and equivalence of datasets

Reply via email to