drin commented on PR #39836:
URL: https://github.com/apache/arrow/pull/39836#issuecomment-2207388561
@judahrand, I wanted to verify something with you since I assume you have an
immediate use case in mind for #32504 .
The naive implementation of hashing for a "shallow" nested type (e.g.
`list<int8>`) is relatively straightforward and the code available in this PR
should be able to handle that already.
What held up this PR was logic for "deep" nested types (e.g.
`list<list<int8>>`). To move towards addressing this, I am considering the
following simple scenario and hoping to get your feedback that it is addressed
reasonably (this is why it would be helpful if you have use cases in mind):
```python
# A "shallow" nested type that we should be able to reproduce (I think)
Reference Test data [length: 3]:
[
[ 1, 1, 2, 3, 4 ],
[ 2, 2, 4, 6, 8 ],
[ 3, 3, 6, 9, 12 ]
]
# A "deep" nested type that we may receive as input
Sample Test data (list<item: list<item: int64>>| [length: 3]):
[
[ [ 1 ], [ 1, 2, 3, 4 ] ],
[ [ 2 ], [ 2, 4, 6, 8 ] ],
[ [ 3 ], [ 3, 6, 9, 12 ] ]
]
```
There is ambiguity in how a "deep" nested type should be hashed (maybe the
nested structure should produce a different hash for a "row" of values), but
what makes the most sense to me is for it to be value-based (the above
example). In the case that a different approach is preferred, other approaches
can either be encoded as future options or a UDF provided instead of the
existing hashing functions.
I have prototyped a naive version of the flattening logic in a separate
repo: [recipe_convert.cpp#L150-L179][src-convertlayout]. For context, this type
of flattening would be used in the [FastHashScalar::Exec][src-keycolumn]
function as a preprocessing step before calling the
[Hashing32::HashMultiColumn][src-hashbatch] function.
Thanks for any feedback you can provide! Particularly on expected behavior
for nested array types.
<!-- resources -->
[src-convertlayout]:
https://github.com/drin/cookbooks/blob/add-recipe-convertlayout/arrow/convert-layout/recipe_convert.cpp#L150-L179
[src-keycolumn]:
https://github.com/drin/arrow/blob/ARROW-8991-newfn-scalar-hash-fresh/cpp/src/arrow/compute/kernels/scalar_hash.cc#L92-L95
[src-hashbatch]:
https://github.com/drin/arrow/blob/ARROW-8991-newfn-scalar-hash-fresh/cpp/src/arrow/compute/key_hash_internal.h#L48-L49
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]