RE: Shared memory "IPC" of Arrow row batches in C++

Zheng, Kai Thu, 24 Mar 2016 01:48:09 -0700

Thanks Wes again for explaining all of this. It looks good.

Regards,
Kai


-----Original Message-----
From: Wes McKinney [mailto:w...@cloudera.com] 
Sent: Tuesday, March 22, 2016 11:11 PM
To: dev@arrow.apache.org
Subject: Re: Shared memory "IPC" of Arrow row batches in C++

hi Kai

On Mon, Mar 21, 2016 at 8:40 AM, Zheng, Kai <kai.zh...@intel.com> wrote:
> Thanks Wes. This sounds a good starting on the IPC direction.
>
>>> It's be great to get some benchmark code written so that we are also able 
>>> to make technical decisions on the basis of measurable performance 
>>> implications.
> Is there any bootstrap setup for the benchmark thing, here, Parquet or 
> elsewhere we can borrow? Does it mean we'll compare two or more approaches or 
> just measure the performance of code path like the read-path you said? For 
> the cpp part, benchmark codes in c++ or python, preferred?
>

Micah is working on ARROW-28 (https://github.com/apache/arrow/pull/29)
which will give us an organized way to create benchmarks.

>>>For example, while the read-path of the above code does not copy any data, 
>>>it would be useful to know how fast reassembling the row batch data 
>>>structure is and how this scales with the number of columns.
> I guess it mean the data is in columnar, the read-path will reassemble it 
> into row batches, without any data copy (by pointers), right?
>

Yes, it's only reassembling C++ objects with memory addresses, no data copying. 
But it will be nice to know how fast this reassembly process is -- I don't know 
yet whether it's low (< 50) microseconds or something more than that.

- Wes

> Bear me if something stupid. Thanks!
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Wes McKinney [mailto:w...@cloudera.com]
> Sent: Saturday, March 19, 2016 2:06 PM
> To: dev@arrow.apache.org
> Subject: Shared memory "IPC" of Arrow row batches in C++
>
> I’ve been collaborating with Steven Phillips (who’s been working on the Java 
> Arrow impl recently) to show a proof of concept ping-ponging Arrow data back 
> and forth between the Java and C++ implementations. We aren’t 100% there yet, 
> but I got C++ to C++ round-trip to memory map working today (for primitive 
> types — e.g. integers):
>
> https://github.com/apache/arrow/pull/28
>
> We created a small metadata specification using Flatbuffers IDL — feedback 
> would be much desired here:
>
> https://github.com/apache/arrow/pull/28/files#diff-520b20e87eb508faa3c
> c7aa9855030d7
>
> This includes:
>
> - Logical schemas
> - Data headers: compact descriptions row batches associated with a 
> particular schema
>
> The idea is that two systems agree up front on “what is the schema” so that 
> only the data header (containing memory offsets and sizes and some other 
> important data-dependent metadata). After working through this in some real 
> code, I’m feeling fairly good that it meets the needs of Arrow for the time 
> being, but there may be some unknown requirements that it would be good to 
> learn about sooner than later.
> After some design review and iteration we’ll want to document the metadata 
> specification as part of the format in more gory detail.
>
> (Note: We are using Flatbuffers for convenience, performance, and 
> development simplicity — one feature that is especially nice is its 
> union support, but it can be done in other serialization tools, too)
>
> It's be great to get some benchmark code written so that we are also able to 
> make technical decisions on the basis of measurable performance implications. 
> For example, while the read-path of the above code does not copy any data, it 
> would be useful to know how fast reassembling the row batch data structure is 
> and how this scales with the number of columns.
>
> best regards,
> Wes

RE: Shared memory "IPC" of Arrow row batches in C++

Reply via email to