On Mon, 22 Mar 2021 06:36:57 +0000 Hagai Har-Gil <[email protected]> wrote: > Hmm, it seems that my mental model was off - I'm indeed interested in an > array of structs and not in a struct of arrays. After re-reading the (Python) > docs I'd argue that they're not clear that a StructArray is indeed a SoA, and > the behavior of the object with respect to indexing further strengthens this > notion I had. I might try to put together a docs PR to address this, if you > think it's worth mentioning.
I don't think it makes sense to mention it specifically in the Python docs, since it's a characteristic of the Arrow format and applies to all implementations: https://arrow.apache.org/docs/format/Columnar.html#struct-layout Regards Antoine. > > Thanks, > Hagai. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Sunday, March 21, 2021 3:51 PM, Antoine Pitrou <[email protected]> wrote: > > > On Sun, 21 Mar 2021 12:33:09 +0000 > > Hagai Har-Gil [email protected] wrote: > > > > > After some more digging I did arrive at something which seems more > > > efficient than what I had: > > > struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())]) > > > nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), > > > ('field1', '<i1')]) > > > struct_array = pa.array(nparray, type=struct_schema) > > > This looks easy, although I'm not sure how much copying is done down > > > below. > > > > The data is definitely copied under the hood, since this is > > converting from an "array of structs" layout (the Numpy array) to a > > "struct of arrays" layout (the Arrow array). > > > > This is a conceptual constraint. I don't think it is possible to > > create a Numpy struct array that would use separate data areas for the > > struct fields. > > > > Regards > > > > Antoine. > > > > > I now have an issue with the Rust implementation since I'm not sure how > > > do I access or iterate over the rows of the resulting StructArray, which > > > was trivial in Python. > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil > > > [email protected] wrote: > > > > > > > After some more digging I did arrive at something which seems more > > > > efficient than what I had: > > > > struct_schema = pa.struct([('field0', pa.int32()), ('field1', > > > > pa.int8())]) > > > > nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), > > > > ('field1', '<i1')]) > > > > struct_array = pa.array(nparray, type=struct_schema) > > > > This looks easy, although I'm not sure how much copying is done down > > > > below. > > > > I now have an issue with the Rust implementation since I'm not sure how > > > > do I access or iterate over the rows of the resulting StructArray. > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > > On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil > > > > [email protected] wrote: > > > > > > > > > Hi, > > > > > I'm trying to efficiently convert incoming numpy.recarray's to > > > > > pyarrow.StructArray and I'm unsure how to do so with the least amount > > > > > of copying. > > > > > My use case involves real time data processing of numpy.recarrays in > > > > > Rust. I'm happily using the IPC protocol to transfer data to Rust's > > > > > arrow implementation which will do the heavy lifting. I'll need to > > > > > iterate on the recarray-turned-StructArray line-by-line, each time > > > > > yielding all fields of a specific row, so the StructArray format is > > > > > quite fitting. However, doing the actual conversion in an efficient > > > > > manner seems harder than expected. The fields (=individual arrays) of > > > > > a numpy.recarray aren't stored in a contiguous manner, so any > > > > > numpy.recarray -> pyarrow.Array conversion first has to copy the data > > > > > to standard pyarrow.Array buffers, and then re-construct the > > > > > StructArray structure by interleaving the arrays. I was unable to > > > > > find in the docs or in previous discussions here a better approach > > > > > for this type of pre-processing step. > > > > > Since I'm using IPC I'll eventually need to have the > > > > > pyarrow.StructArray wrapped in a pyarrow.RecordBatch if that makes > > > > > any difference. > > > > > Thanks in advance > > >
