drin commented on PR #39836:
URL: https://github.com/apache/arrow/pull/39836#issuecomment-2207388561

   @judahrand, I wanted to verify something with you since I assume you have an 
immediate use case in mind for #32504 .
   
   The naive implementation of hashing for a "shallow" nested type (e.g. 
`list<int8>`) is relatively straightforward and the code available in this PR 
should be able to handle that already.
   
   What held up this PR was logic for "deep" nested types (e.g. 
`list<list<int8>>`). To move towards addressing this, I am considering the 
following simple scenario and hoping to get your feedback that it is addressed 
reasonably (this is why it would be helpful if you have use cases in mind):
   ```python
   # A "shallow" nested type that we should be able to reproduce (I think)
   Reference Test data [length: 3]:
   [
     [ 1, 1, 2, 3,  4 ],
     [ 2, 2, 4, 6,  8 ],
     [ 3, 3, 6, 9, 12 ]
   ]
   
   # A "deep" nested type that we may receive as input
   Sample Test data (list<item: list<item: int64>>| [length: 3]):
   [
     [ [ 1 ], [ 1, 2, 3,  4 ] ],
     [ [ 2 ], [ 2, 4, 6,  8 ] ],
     [ [ 3 ], [ 3, 6, 9, 12 ] ]
   ]
   ```
   
   There is ambiguity in how a "deep" nested type should be hashed (maybe the 
nested structure should produce a different hash for a "row" of values), but 
what makes the most sense to me is for it to be value-based (the above 
example). In the case that a different approach is preferred, other approaches 
can either be encoded as future options or a UDF provided instead of the 
existing hashing functions.
   
   I have prototyped a naive version of the flattening logic in a separate 
repo: [recipe_convert.cpp#L150-L179][src-convertlayout]. For context, this type 
of flattening would be used in the [FastHashScalar::Exec][src-keycolumn] 
function as a preprocessing step before calling the 
[Hashing32::HashMultiColumn][src-hashbatch] function.
   
   Thanks for any feedback you can provide! Particularly on expected behavior 
for nested array types.
   
   <!-- resources -->
   [src-convertlayout]: 
https://github.com/drin/cookbooks/blob/add-recipe-convertlayout/arrow/convert-layout/recipe_convert.cpp#L150-L179
   [src-keycolumn]: 
https://github.com/drin/arrow/blob/ARROW-8991-newfn-scalar-hash-fresh/cpp/src/arrow/compute/kernels/scalar_hash.cc#L92-L95
   [src-hashbatch]: 
https://github.com/drin/arrow/blob/ARROW-8991-newfn-scalar-hash-fresh/cpp/src/arrow/compute/key_hash_internal.h#L48-L49


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to