Re: [Python] Efficient numpy.recarray to pyarrow.StructArray conversion

Hagai Har-Gil Sun, 21 Mar 2021 23:37:14 -0700

Hmm, it seems that my mental model was off - I'm indeed interested in an array 
of structs and not in a struct of arrays. After re-reading the (Python) docs 
I'd argue that they're not clear that a StructArray is indeed a SoA, and the 
behavior of the object with respect to indexing further strengthens this notion 
I had. I might try to put together a docs PR to address this, if you think it's 
worth mentioning.


Thanks,
Hagai.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Sunday, March 21, 2021 3:51 PM, Antoine Pitrou <[email protected]> wrote:

> On Sun, 21 Mar 2021 12:33:09 +0000
> Hagai Har-Gil [email protected] wrote:
>
> > After some more digging I did arrive at something which seems more 
> > efficient than what I had:
> > struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())])
> > nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), 
> > ('field1', '<i1')])
> > struct_array = pa.array(nparray, type=struct_schema)
> > This looks easy, although I'm not sure how much copying is done down below.
>
> The data is definitely copied under the hood, since this is
> converting from an "array of structs" layout (the Numpy array) to a
> "struct of arrays" layout (the Arrow array).
>
> This is a conceptual constraint. I don't think it is possible to
> create a Numpy struct array that would use separate data areas for the
> struct fields.
>
> Regards
>
> Antoine.
>
> > I now have an issue with the Rust implementation since I'm not sure how do 
> > I access or iterate over the rows of the resulting StructArray, which was 
> > trivial in Python.
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil [email protected] 
> > wrote:
> >
> > > After some more digging I did arrive at something which seems more 
> > > efficient than what I had:
> > > struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())])
> > > nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), 
> > > ('field1', '<i1')])
> > > struct_array = pa.array(nparray, type=struct_schema)
> > > This looks easy, although I'm not sure how much copying is done down 
> > > below.
> > > I now have an issue with the Rust implementation since I'm not sure how 
> > > do I access or iterate over the rows of the resulting StructArray.
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil 
> > > [email protected] wrote:
> > >
> > > > Hi,
> > > > I'm trying to efficiently convert incoming numpy.recarray's to 
> > > > pyarrow.StructArray and I'm unsure how to do so with the least amount 
> > > > of copying.
> > > > My use case involves real time data processing of numpy.recarrays in 
> > > > Rust. I'm happily using the IPC protocol to transfer data to Rust's 
> > > > arrow implementation which will do the heavy lifting. I'll need to 
> > > > iterate on the recarray-turned-StructArray line-by-line, each time 
> > > > yielding all fields of a specific row, so the StructArray format is 
> > > > quite fitting. However, doing the actual conversion in an efficient 
> > > > manner seems harder than expected. The fields (=individual arrays) of a 
> > > > numpy.recarray aren't stored in a contiguous manner, so any 
> > > > numpy.recarray -> pyarrow.Array conversion first has to copy the data 
> > > > to standard pyarrow.Array buffers, and then re-construct the 
> > > > StructArray structure by interleaving the arrays. I was unable to find 
> > > > in the docs or in previous discussions here a better approach for this 
> > > > type of pre-processing step.
> > > > Since I'm using IPC I'll eventually need to have the 
> > > > pyarrow.StructArray wrapped in a pyarrow.RecordBatch if that makes any 
> > > > difference.
> > > > Thanks in advance

Re: [Python] Efficient numpy.recarray to pyarrow.StructArray conversion

Reply via email to