Gotcha - If there is no penalty from RecordBatch<->StructArray then I am happy with the current approach - thanks!
For Spencer's question, the reason that I use StructArray is because the kernel interfaces I am interested in uses Array interface instead of RecordBatch, so StructArray is easier than RecordBatch to interact with kernels. On Tue, Jun 13, 2023 at 4:15 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > I think your original code roundtripping through RecordBatch > (`pa.RecordBatch.from_pandas(df).to_struct_array()`) is the best > option at the moment. The RecordBatch<->StructArray part is a cheap > (zero-copy) conversion, and by using RecordBatch.from_pandas, you can > rely on all pandas<->arrow conversion logic that is implemented in > pyarrow (and which keeps the data columnar, in contrast to > `df.itertuples()` which converts the data into rows of python objects > as intermediate). > > Given that the conversion through RecordBatch works nicely, I am not > sure it is worth it to add new APIs to directly convert between > StructArray and pandas DataFrames. > > Joris > > On Mon, 12 Jun 2023 at 20:32, Spencer Nelson <swnel...@uw.edu> wrote: > > > > Here's a one-liner that does it, but I expect it's moderately slower than > > the RecordBatch version: > > > > pa.array(df.itertuples(index=False), type=pa.struct([pa.field(col, > > pa.from_numpy_dtype(df.dtypes[col])) for col in df.columns])) > > > > Most of the complexity is in the 'type'. It's less scary than it looks, > and > > if you can afford multiple lines I think it's almost readable: > > > > fields = [pa.field(col, pa.from_numpy_dtype(df.dtypes[col])) for col in > > df.columns] > > pa_type = pa.struct(fields) > > pa.array(df.itertuples(index=False, type=pa_type) > > > > But this seems like a classic XY problem. What is the root issue you're > > trying to solve? Why avoid RecordBatch? > > > > On Mon, Jun 12, 2023 at 11:14 AM Li Jin <ice.xell...@gmail.com> wrote: > > > > > !-------------------------------------------------------------------| > > > This Message Is From an Untrusted Sender > > > You have not previously corresponded with this sender. > > > See https://itconnect.uw.edu/email-tags for additional > > > information. Please contact the UW-IT Service Center, > > > h...@uw.edu 206.221.5000, for assistance. > > > |-------------------------------------------------------------------! > > > > > > Gentle bump. > > > > > > Not a big deal if I need to use the API above to do so, but bump in > case > > > someone has a better way. > > > > > > On Fri, Jun 9, 2023 at 4:34 PM Li Jin <ice.xell...@gmail.com> wrote: > > > > > > > Hello, > > > > > > > > I am looking for the best ways for converting Pandas DataFrame <-> > Struct > > > > Array. > > > > > > > > Currently I have: > > > > > > > > pa.RecordBatch.from_pandas(df).to_struct_array() > > > > > > > > and > > > > > > > > pa.RecordBatch.from_struct_array(s_array).to_pandas() > > > > > > > > - I wonder if there is a direct way to go from DataFrame <-> Struct > Array > > > > without going through RecordBatch? > > > > > > > > Thanks, > > > > Li > > > > > > > >