rok commented on issue #12553: URL: https://github.com/apache/arrow/issues/12553#issuecomment-1132855049
Thanks for the background info @madhavajay ! > From a tabular perspective the data is essentially a row for each data subject (of whom we are protecting their privacy), like 1 large n-dim tensor, 2x similar ndim tensors (min and max providing bounds for the DP algorithms) That sounds like a good fit for TensorArray as proposed in #8510 for c++ or as implemented in Python [here](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/tensor.py#L282) > potentially with a lot of repeated data (so we made a custom datatype called lazyrepeatarray which removes duplicate dimensions) Given you duplicate dimensions - would [CSF sparse tensors](https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/python/pyarrow/tests/test_sparse_tensor.py#L247-L271) help? Unlike CSR and CSC it will support n-dimensional tensors. Reduced size for transport and storage would be the main benefit as you'd still need dense forms for computation. Well some aggregate functions can be applied to sparse formats too.. > However if there is no ability to do computation with pyarrow on that data then we just need to take it back out anyway. Currently were doing things like aggregate sum operations etc but we are implementing the entire suite of ops required for DL so we need numpy style flexibility. Numpy does indeed seem the safest option. However as David mentioned there is ongoing work on Python UDFs, [existing features are tested here](https://github.com/vibhatha/arrow/blob/master/python/pyarrow/tests/test_udf.py). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
