[GitHub] [arrow] rok commented on issue #12553: Support for Compute Functions on Nested Arrays

GitBox Fri, 20 May 2022 05:39:47 -0700


rok commented on issue #12553:
URL: https://github.com/apache/arrow/issues/12553#issuecomment-1132855049

Thanks for the background info @madhavajay !

> From a tabular perspective the data is essentially a row for each data
subject (of whom we are protecting their privacy), like 1 large n-dim tensor,
2x similar ndim tensors (min and max providing bounds for the DP algorithms)

That sounds like a good fit for TensorArray as proposed in #8510 for c++ or
as implemented in Python
[here](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/tensor.py#L282)

> potentially with a lot of repeated data (so we made a custom datatype
called lazyrepeatarray which removes duplicate dimensions)

Given you duplicate dimensions - would [CSF sparse
tensors](https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/python/pyarrow/tests/test_sparse_tensor.py#L247-L271)
help? Unlike CSR and CSC it will support n-dimensional tensors. Reduced size
for transport and storage would be the main benefit as you'd still need dense
forms for computation. Well some aggregate functions can be applied to sparse
formats too..

> However if there is no ability to do computation with pyarrow on that data
then we just need to take it back out anyway. Currently were doing things like
aggregate sum operations etc but we are implementing the entire suite of ops
required for DL so we need numpy style flexibility.

Numpy does indeed seem the safest option. However as David mentioned there
is ongoing work on Python UDFs, [existing features are tested
here](https://github.com/vibhatha/arrow/blob/master/python/pyarrow/tests/test_udf.py).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rok commented on issue #12553: Support for Compute Functions on Nested Arrays

Reply via email to