AFAIK hashing in this context needs to be done on a slot by slot basis, just like array equality, as any item on a null slot has a value on the buffer that is undetermined.
E.g. the layout of a primitive array [1, 2, None, 4] is two buffer regions: * [1, 2, ?, 4] and * [true, true, false, true] (in bitmap) The question mark can be any number. Hashing needs to skip the "?", which is achieved by iterating over [(1, true), (2, true), (?, false), (4, true)] and using a unique hash for the false case (representing the None) Best, Jorge On Sat, Dec 4, 2021 at 6:26 AM Weston Pace <[email protected]> wrote: > One possibility could be to calculate the hash of the logical data > when serializing and then put the hash in the metadata. > > > I'm not even sure this can actually happen ... After all buffers should > only carry primitive types (not some complex structs) and they all seem to > be 16/32/64/128 bit long and should produce "gapless" buffers. > > Arrow buffers are aligned on 8 or 64 byte boundaries and there is a > preference to align on 64 byte boundaries. So I think gaps/padding is > a real possibility. > > On Fri, Dec 3, 2021 at 3:05 PM Sergii Mikhtoniuk <[email protected]> > wrote: > > > > Apologies for the confusion, I was using wrong terminology. When I was > talking about "array chunks" I meant Buffers - contiguous slices of memory > with nullability, offsets, or value data. > > > > If Arrow is not explicit about Buffers having to be memset to zero > before use - whenever the size of the vale is not a multiple of its > alignment we would have garbage in between, messing up the stability of a > buffer-wise hash. > > > > I'm not even sure this can actually happen ... After all buffers should > only carry primitive types (not some complex structs) and they all seem to > be 16/32/64/128 bit long and should produce "gapless" buffers. >
