hi Chengxin, Yes, if you look at the JIRA tracker and look for past discussions on the mailing list, there are plans to develop comprehensive data manipulation and query processing capabilities in this project for use in Python, R, and any other language that binds to C++, including C/GLib and Ruby.
The way that this functionality is exposed in the pyarrow API will almost certainly be different than pandas, though. Rather than have objects with long lists of instance methods, we would opt instead for computational functions that "act" on the data structures, producing one or more data structures as output, more similar to tools like dplyr (an R library). Developers are welcome to create pandas-like convenience layers, of course, should they so choose. References: * C++ datasets API project https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing * C++ query engine project https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing * C++ data frame API project https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing Building these things take time, especially considering the scope of maintenance involved with keeping this project running. If anyone reading is interested in contributing time or money to this effort I'd be happy to speak with you offline about it. If you would like to contribute we would be glad to have you aboard. Thanks Wes On Thu, Apr 2, 2020 at 6:50 AM Chengxin Ma <c...@protonmail.ch.invalid> wrote: > > Hi all, > > I am working on a distributed sorting program which runs on multiple > computation nodes. > > In this sorting program, data is represented as pandas DataFrames and key > operations are groupby, concat, and sort_values. For shuffling data among the > computation nodes, the DataFrames are converted to Arrow Record Batches and > communicated via Arrow Flight. > > What I’ve noticed is that much time was spent on the conversion between > DataFrame and Record Batch. > > The [zero-copy > feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy) > unfortunately cannot be applied to my case, since the DataFrames contain > strings as well. > > I wanted to try replacing DataFrames with Record Batches, so there would be > no need of conversion. However, there seems to be no direct way to do groupby > and sort_values on Record Batches, according to [the > documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html) > > Is there a plan to add such methods to the API of Record Batch in the future? > > Kind Regards > > Chengxin > > Sent with [ProtonMail](https://protonmail.com) Secure Email.