Re: [DISCUSS] C-level in-process array protocol

Jacques Nadeau Tue, 01 Oct 2019 12:55:52 -0700

I disagree with this statement:

- the IPC format is meant for serialization while the C data protocol is
meants for in-memory communication, so different concerns apply

If that is how the a particular implementation presents it, that is a
weaknesses of the implementation, not the format. The primary use case I
was focused on when working on the initial format was communication within
the same process. It seems like this is being used as a basis for the
introduction of new things when the premise is inconsistent with the
intention of the creation. The specific reason we used flatbuffers in the
project was to collapse the separation of in-process and out-of-process
communication. It means the same thing it does with the Arrow data itself:
that a consumer doesn't have to use a particular library to interact with
and use the data.

It seems like there are two ideas here:

1) How do we make it easier for people to use Arrow?
2) Should we implement a new in memory representation of Arrow that is
language specific.

I'm entirely in support of number one. If for a particular type of domain,
people want an easier way to interact with Arrow, let's make a new library
that helps with that. In easy of our current libraries, we do many things
to make it easier to work with Arrow. None of those require a change to the
core format or are formalized as a new in-memory standard. The in-memory
representation of rust or javascript or java objects are implementation
details.

I'm against number two as it creates a fragmentation problem. Arrow is
about having a single canonical format for memory for both metadata and
data. Having multiple in-memory formats (especially when some are not
language independent) is counter to the goals of the project.

Two other, separate comments:
1) I don't understand the idea that we need to change the way Arrow
fundamentally works so that people can avoid using a dependency. If the
dependency is small, open source and easy to build, people can fork it and
include directly if they want to. Let's not violate project principles
because DuckDB has a religious perspective on dependencies. If the problem
is people have to swallow too large of a pill to do basic things with Arrow
in C, let's focus on fixing that (to our definition of ease, not someone
else's). If FlatCC solves some those things, great. If we need to build a
baby integration library that is more C centric, great. Neither of those
things require implementing something at the format level.

2) It seems like we should discuss the data structure problem separately
from the reference management concern.

On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Antoine,
>
> On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> >
> > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > A couple things:
> > >
> > > * I think a C protocol / FFI for Arrow array/vectors would be better
> > > to have the same "shape" as an assembled array. Note that the C
> > > structs here have very nearly the same "shape" as the data structure
> > > representing a C++ Array object [1]. The disassembly and reassembly
> > > here is substantially simpler than the IPC protocol. A recursive
> > > structure in Flatbuffers would make RecordBatch messages much larger,
> > > so the flattened / disassembled representation we use for serialized
> > > record batches is the correct one
> >
> > I'm not sure I agree:
> >
> > - indeed, it's not a coincidence that the ArrowArray struct looks quite
> > closely like the C++ ArrayData object :-)  We have good experience with
> > that abstraction and it has proven to work quite well
> >
> > - the IPC format is meant for serialization while the C data protocol is
> > meants for in-memory communication, so different concerns apply
> >
> > - the fact that this makes the layout slightly larger doesn't seem
> > important at all; we're not talking about transferring data over the wire
> >
> > There's also another argument for having a recursive struct: it
> > simplifies how the data type is represented, since we can encode each
> > child type individually instead of encoding it in the parent's format
> > string (same applies for metadata and individual flags).
> >
>
> I was saying something different here. I was making an argument about
> why we use the flattened array-of-structs in the IPC protocol. One
> reason is that it's a more compact representation. That is not very
> important here because this protocol is only for *in-process* (for
> languages that have a C FFI facility) rather than *inter-process*
> communication.
>
> I agree also that the type encoding is simple, here, too, since we
> aren't having to split the schema and record batch between different
> serialized messages. There is some potential waste with having to
> populate the type fields multiple times when communicating a sequence
> of "chunks" from the same logical dataset.
>
> > > * The "formal" C protocol having the "assembled" shape means that many
> > > minimal Arrow users won't have to implement any separate data
> > > structures. They can just use the C struct directly or a slightly
> > > wrapped version thereof with some convenience functions.
> >
> > Yes, but the same applies to the current proposal.
> >
> > > * I think that requiring building a Flatbuffer for minimal use cases
> > > (e.g. communicating simple record batches with primitive types) passes
> > > on implementation burden to minimal users.
> >
> > It certainly does.
> >
> > > I think the mantra of the C protocol should be the following:
> > >
> > > * Users of the protocol have to write little to no code to use it. For
> > > example, populating an INT32 array should require only a few lines of
> > > code
> >
> > Agreed.  As a sidenote, the spec should have an example of doing this in
> > raw C.
> >
> > Regards
> >
> > Antoine.
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to