After some more digging I did arrive at something which seems more efficient 
than what I had:

struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())])
nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), ('field1', 
'<i1')])
struct_array = pa.array(nparray, type=struct_schema)

This looks easy, although I'm not sure how much copying is done down below.

I now have an issue with the Rust implementation since I'm not sure how do I 
access or iterate over the rows of the resulting StructArray, which was trivial 
in Python.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Sunday, March 21, 2021 2:22 PM, Hagai Har-Gil <[email protected]> 
wrote:

> After some more digging I did arrive at something which seems more efficient 
> than what I had:
>
> struct_schema = pa.struct([('field0', pa.int32()), ('field1', pa.int8())])
> nparray = x = np.array([(0, 10), (1, 20)], dtype=[('field0', '<i4'), 
> ('field1', '<i1')])
> struct_array = pa.array(nparray, type=struct_schema)
>
> This looks easy, although I'm not sure how much copying is done down below.
>
> I now have an issue with the Rust implementation since I'm not sure how do I 
> access or iterate over the rows of the resulting StructArray.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Sunday, March 21, 2021 10:52 AM, Hagai Har-Gil 
> <[email protected]> wrote:
>
>> Hi,
>>
>> I'm trying to efficiently convert incoming numpy.recarray's to 
>> pyarrow.StructArray and I'm unsure how to do so with the least amount of 
>> copying.
>>
>> My use case involves real time data processing of numpy.recarrays in Rust. 
>> I'm happily using the IPC protocol to transfer data to Rust's arrow 
>> implementation which will do the heavy lifting. I'll need to iterate on the 
>> recarray-turned-StructArray line-by-line, each time yielding all fields of a 
>> specific row, so the StructArray format is quite fitting. However, doing the 
>> actual conversion in an efficient manner seems harder than expected. The 
>> fields (=individual arrays) of a numpy.recarray aren't stored in a 
>> contiguous manner, so any numpy.recarray -> pyarrow.Array conversion first 
>> has to copy the data to standard pyarrow.Array buffers, and then 
>> re-construct the StructArray structure by interleaving the arrays. I was 
>> unable to find in the docs or in previous discussions here a better approach 
>> for this type of pre-processing step.
>>
>> Since I'm using IPC I'll eventually need to have the pyarrow.StructArray 
>> wrapped in a pyarrow.RecordBatch if that makes any difference.
>>
>> Thanks in advance.

Reply via email to