Re: Support of more manipulation for Record Batch

Wes McKinney Thu, 02 Apr 2020 08:46:52 -0700

hi Chengxin,

Yes, if you look at the JIRA tracker and look for past discussions on
the mailing list, there are plans to develop comprehensive data
manipulation and query processing capabilities in this project for use
in Python, R, and any other language that binds to C++, including
C/GLib and Ruby.

The way that this functionality is exposed in the pyarrow API will
almost certainly be different than pandas, though. Rather than have
objects with long lists of instance methods, we would opt instead for
computational functions that "act" on the data structures, producing
one or more data structures as output, more similar to tools like
dplyr (an R library). Developers are welcome to create pandas-like
convenience layers, of course, should they so choose.

References:

* C++ datasets API project
https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
* C++ query engine project
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing
* C++ data frame API project
https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing

Building these things take time, especially considering the scope of
maintenance involved with keeping this project running. If anyone
reading is interested in contributing time or money to this effort I'd
be happy to speak with you offline about it. If you would like to
contribute we would be glad to have you aboard.

Thanks
Wes

On Thu, Apr 2, 2020 at 6:50 AM Chengxin Ma <c...@protonmail.ch.invalid> wrote:
>
> Hi all,
>
> I am working on a distributed sorting program which runs on multiple 
> computation nodes.
>
> In this sorting program, data is represented as pandas DataFrames and key 
> operations are groupby, concat, and sort_values. For shuffling data among the 
> computation nodes, the DataFrames are converted to Arrow Record Batches and 
> communicated via Arrow Flight.
>
> What I’ve noticed is that much time was spent on the conversion between 
> DataFrame and Record Batch.
>
> The [zero-copy 
> feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)
>  unfortunately cannot be applied to my case, since the DataFrames contain 
> strings as well.
>
> I wanted to try replacing DataFrames with Record Batches, so there would be 
> no need of conversion. However, there seems to be no direct way to do groupby 
> and sort_values on Record Batches, according to [the 
> documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html)
>
> Is there a plan to add such methods to the API of Record Batch in the future?
>
> Kind Regards
>
> Chengxin
>
> Sent with [ProtonMail](https://protonmail.com) Secure Email.

Re: Support of more manipulation for Record Batch

Reply via email to