RE: Shared memory "IPC" of Arrow row batches in C++

Zheng, Kai Mon, 21 Mar 2016 08:49:22 -0700

Thanks Wes. This sounds a good starting on the IPC direction. 

>> It's be great to get some benchmark code written so that we are also able to 
>> make technical decisions on the basis of measurable performance 
>> implications. 
Is there any bootstrap setup for the benchmark thing, here, Parquet or 
elsewhere we can borrow? Does it mean we'll compare two or more approaches or 
just measure the performance of code path like the read-path you said? For the 
cpp part, benchmark codes in c++ or python, preferred?


>>For example, while the read-path of the above code does not copy any data, it 
>>would be useful to know how fast reassembling the row batch data structure is 
>>and how this scales with the number of columns.
I guess it mean the data is in columnar, the read-path will reassemble it into 
row batches, without any data copy (by pointers), right?

Bear me if something stupid. Thanks!

Regards,
Kai

-----Original Message-----
From: Wes McKinney [mailto:[email protected]] 
Sent: Saturday, March 19, 2016 2:06 PM
To: [email protected]
Subject: Shared memory "IPC" of Arrow row batches in C++

I’ve been collaborating with Steven Phillips (who’s been working on the Java 
Arrow impl recently) to show a proof of concept ping-ponging Arrow data back 
and forth between the Java and C++ implementations. We aren’t 100% there yet, 
but I got C++ to C++ round-trip to memory map working today (for primitive 
types — e.g. integers):

https://github.com/apache/arrow/pull/28

We created a small metadata specification using Flatbuffers IDL — feedback 
would be much desired here:

https://github.com/apache/arrow/pull/28/files#diff-520b20e87eb508faa3cc7aa9855030d7

This includes:

- Logical schemas
- Data headers: compact descriptions row batches associated with a particular 
schema

The idea is that two systems agree up front on “what is the schema” so that 
only the data header (containing memory offsets and sizes and some other 
important data-dependent metadata). After working through this in some real 
code, I’m feeling fairly good that it meets the needs of Arrow for the time 
being, but there may be some unknown requirements that it would be good to 
learn about sooner than later.
After some design review and iteration we’ll want to document the metadata 
specification as part of the format in more gory detail.

(Note: We are using Flatbuffers for convenience, performance, and development 
simplicity — one feature that is especially nice is its union support, but it 
can be done in other serialization tools, too)

It's be great to get some benchmark code written so that we are also able to 
make technical decisions on the basis of measurable performance implications. 
For example, while the read-path of the above code does not copy any data, it 
would be useful to know how fast reassembling the row batch data structure is 
and how this scales with the number of columns.

best regards,
Wes

RE: Shared memory "IPC" of Arrow row batches in C++

Reply via email to