Gotcha - If there is no penalty from RecordBatch<->StructArray then I am
happy with the current approach - thanks!

For Spencer's question, the reason that I use StructArray is because the
kernel interfaces I am interested in uses Array interface instead of
RecordBatch, so StructArray is easier than RecordBatch to interact with
kernels.

On Tue, Jun 13, 2023 at 4:15 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> I think your original code roundtripping through RecordBatch
> (`pa.RecordBatch.from_pandas(df).to_struct_array()`) is the best
> option at the moment. The RecordBatch<->StructArray part is a cheap
> (zero-copy) conversion, and by using RecordBatch.from_pandas, you can
> rely on all pandas<->arrow conversion logic that is implemented in
> pyarrow (and which keeps the data columnar, in contrast to
> `df.itertuples()` which converts the data into rows of python objects
> as intermediate).
>
> Given that the conversion through RecordBatch works nicely, I am not
> sure it is worth it to add new APIs to directly convert between
> StructArray and pandas DataFrames.
>
> Joris
>
> On Mon, 12 Jun 2023 at 20:32, Spencer Nelson <swnel...@uw.edu> wrote:
> >
> > Here's a one-liner that does it, but I expect it's moderately slower than
> > the RecordBatch version:
> >
> > pa.array(df.itertuples(index=False), type=pa.struct([pa.field(col,
> > pa.from_numpy_dtype(df.dtypes[col])) for col in df.columns]))
> >
> > Most of the complexity is in the 'type'. It's less scary than it looks,
> and
> > if you can afford multiple lines I think it's almost readable:
> >
> > fields = [pa.field(col, pa.from_numpy_dtype(df.dtypes[col])) for col in
> > df.columns]
> > pa_type = pa.struct(fields)
> > pa.array(df.itertuples(index=False, type=pa_type)
> >
> > But this seems like a classic XY problem. What is the root issue you're
> > trying to solve? Why avoid RecordBatch?
> >
> > On Mon, Jun 12, 2023 at 11:14 AM Li Jin <ice.xell...@gmail.com> wrote:
> >
> > > !-------------------------------------------------------------------|
> > >   This Message Is From an Untrusted Sender
> > >   You have not previously corresponded with this sender.
> > >   See https://itconnect.uw.edu/email-tags for additional
> > >   information.  Please contact the UW-IT Service Center,
> > >   h...@uw.edu 206.221.5000, for assistance.
> > > |-------------------------------------------------------------------!
> > >
> > > Gentle bump.
> > >
> > > Not a big deal if I need to use the API above to do so, but bump in
> case
> > > someone has a better way.
> > >
> > > On Fri, Jun 9, 2023 at 4:34 PM Li Jin <ice.xell...@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > I am looking for the best ways for converting Pandas DataFrame <->
> Struct
> > > > Array.
> > > >
> > > > Currently I have:
> > > >
> > > > pa.RecordBatch.from_pandas(df).to_struct_array()
> > > >
> > > > and
> > > >
> > > > pa.RecordBatch.from_struct_array(s_array).to_pandas()
> > > >
> > > > - I wonder if there is a direct way to go from DataFrame <-> Struct
> Array
> > > > without going through RecordBatch?
> > > >
> > > > Thanks,
> > > > Li
> > > >
> > >
>

Reply via email to