Re: Hashing and equivalence of datasets

Niranda Perera Fri, 03 Dec 2021 15:36:04 -0800

Hey Sergii,

AFAIK latest arrow compute dev work contains row-wise hash value
calculation (for hash aggregations, group-by, etc). But I don't think it is
exposed as a Compute API yet. This has been discussed here [1].

I have come across a similar requirement like yours in our project [2],
where we wanted to provide set operations such as (union, intersection,
set-difference, etc).

I didn't quite understand what you mean by 'padding between aligned
values'. Are you referring to the null-values in an array (i.e. whether
null values are zeroed out in the data buffers)? Since the arrow format is
columnar, you have to access i'th value of each array separately and create
a composite hash for each row. ATM you can not get a 'row-view' of a table.

As per equivalence, in Cylon we check `len(A.set_difference(B)) == 0`,
where A, B tables can have duplicated rows. set_difference, simply creates
a hash_table using A and queries each row of B against it.

[1] https://issues.apache.org/jira/browse/ARROW-8991
[2] https://github.com/cylondata/cylon

On Fri, Dec 3, 2021 at 5:38 PM Sergii Mikhtoniuk <[email protected]>
wrote:

> Hi,
>
> I'm working on a data processing tool that guarantees reproducibility /
> determinism of operations. It's a frequent task for me to verify that one
> dataset (Table) is equivalent to another.
>
> I didn't find any functions related to computing hash sums in Arrow, but
> wondering if anyone knows existing implementations?
>
> If I were to implement a hashing over chunked arrays myself, does Arrow
> guarantee that any sort of padding between aligned values is zeroed-out, so
> that hashes are perfectly stable?
>
> Bonus question: Has anyone seen hashing algorithms for tabular data that
> can check for equivalence (rather than equality)? i.e. I consider datasets
> equivalent if they contain the same set of records, but not necessarily in
> the same order.
>
> Thank you!
> - Sergii
>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Re: Hashing and equivalence of datasets

Reply via email to