Re: [DISCUSS] C-level in-process array protocol

Jed Brown Tue, 01 Oct 2019 13:23:05 -0700

I'd just like to chime in with the use case of in-situ data analysis for
simulations.  This domain tends to be cautious with dependencies and
there is a lot of C and Fortran, but the in-situ analysis tools will
preferably reside in separate processes while sharing memory via shared
memory (/dev/shm or MPI_Win_allocate_shared).  An in-memory protocol
that holds raw pointers would be problematic because they are typically
in different virtual address spaces when shared between processes.  I
think this is a potential application for a C interface with lean
dependencies, but it wouldn't be useful if it can't be shared
out-of-process.


Jacques Nadeau <[email protected]> writes:

> I disagree with this statement:
>
> - the IPC format is meant for serialization while the C data protocol is
> meants for in-memory communication, so different concerns apply
>
> If that is how the a particular implementation presents it, that is a
> weaknesses of the implementation, not the format. The primary use case I
> was focused on when working on the initial format was communication within
> the same process. It seems like this is being used as a basis for the
> introduction of new things when the premise is inconsistent with the
> intention of the creation. The specific reason we used flatbuffers in the
> project was to collapse the separation of in-process and out-of-process
> communication. It means the same thing it does with the Arrow data itself:
> that a consumer doesn't have to use a particular library to interact with
> and use the data.
>
> It seems like there are two ideas here:
>
> 1) How do we make it easier for people to use Arrow?
> 2) Should we implement a new in memory representation of Arrow that is
> language specific.
>
> I'm entirely in support of number one. If for a particular type of domain,
> people want an easier way to interact with Arrow, let's make a new library
> that helps with that. In easy of our current libraries, we do many things
> to make it easier to work with Arrow. None of those require a change to the
> core format or are formalized as a new in-memory standard. The in-memory
> representation of rust or javascript or java objects are implementation
> details.
>
> I'm against number two as it creates a fragmentation problem. Arrow is
> about having a single canonical format for memory for both metadata and
> data. Having multiple in-memory formats (especially when some are not
> language independent) is counter to the goals of the project.
>
> Two other, separate comments:
> 1) I don't understand the idea that we need to change the way Arrow
> fundamentally works so that people can avoid using a dependency. If the
> dependency is small, open source and easy to build, people can fork it and
> include directly if they want to. Let's not violate project principles
> because DuckDB has a religious perspective on dependencies. If the problem
> is people have to swallow too large of a pill to do basic things with Arrow
> in C, let's focus on fixing that (to our definition of ease, not someone
> else's). If FlatCC solves some those things, great. If we need to build a
> baby integration library that is more C centric, great. Neither of those
> things require implementing something at the format level.
>
> 2) It seems like we should discuss the data structure problem separately
> from the reference management concern.
>
>
> On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <[email protected]> wrote:
>
>> hi Antoine,
>>
>> On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <[email protected]> wrote:
>> >
>> >
>> > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
>> > > A couple things:
>> > >
>> > > * I think a C protocol / FFI for Arrow array/vectors would be better
>> > > to have the same "shape" as an assembled array. Note that the C
>> > > structs here have very nearly the same "shape" as the data structure
>> > > representing a C++ Array object [1]. The disassembly and reassembly
>> > > here is substantially simpler than the IPC protocol. A recursive
>> > > structure in Flatbuffers would make RecordBatch messages much larger,
>> > > so the flattened / disassembled representation we use for serialized
>> > > record batches is the correct one
>> >
>> > I'm not sure I agree:
>> >
>> > - indeed, it's not a coincidence that the ArrowArray struct looks quite
>> > closely like the C++ ArrayData object :-)  We have good experience with
>> > that abstraction and it has proven to work quite well
>> >
>> > - the IPC format is meant for serialization while the C data protocol is
>> > meants for in-memory communication, so different concerns apply
>> >
>> > - the fact that this makes the layout slightly larger doesn't seem
>> > important at all; we're not talking about transferring data over the wire
>> >
>> > There's also another argument for having a recursive struct: it
>> > simplifies how the data type is represented, since we can encode each
>> > child type individually instead of encoding it in the parent's format
>> > string (same applies for metadata and individual flags).
>> >
>>
>> I was saying something different here. I was making an argument about
>> why we use the flattened array-of-structs in the IPC protocol. One
>> reason is that it's a more compact representation. That is not very
>> important here because this protocol is only for *in-process* (for
>> languages that have a C FFI facility) rather than *inter-process*
>> communication.
>>
>> I agree also that the type encoding is simple, here, too, since we
>> aren't having to split the schema and record batch between different
>> serialized messages. There is some potential waste with having to
>> populate the type fields multiple times when communicating a sequence
>> of "chunks" from the same logical dataset.
>>
>> > > * The "formal" C protocol having the "assembled" shape means that many
>> > > minimal Arrow users won't have to implement any separate data
>> > > structures. They can just use the C struct directly or a slightly
>> > > wrapped version thereof with some convenience functions.
>> >
>> > Yes, but the same applies to the current proposal.
>> >
>> > > * I think that requiring building a Flatbuffer for minimal use cases
>> > > (e.g. communicating simple record batches with primitive types) passes
>> > > on implementation burden to minimal users.
>> >
>> > It certainly does.
>> >
>> > > I think the mantra of the C protocol should be the following:
>> > >
>> > > * Users of the protocol have to write little to no code to use it. For
>> > > example, populating an INT32 array should require only a few lines of
>> > > code
>> >
>> > Agreed.  As a sidenote, the spec should have an example of doing this in
>> > raw C.
>> >
>> > Regards
>> >
>> > Antoine.
>>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to