Re: [DISCUSS] C-level in-process array protocol

2019-10-08 Thread Wes McKinney
On Tue, Oct 8, 2019 at 3:34 PM Wes McKinney wrote: > > hi Jacques, > > On Tue, Oct 8, 2019 at 1:54 PM Jacques Nadeau wrote: > > > > I removing all my objections to this work. > > > > I wish there was more feedback from additional community members. I > > continue to be concerned about

Re: [DISCUSS] C-level in-process array protocol

2019-10-08 Thread Wes McKinney
hi Jacques, On Tue, Oct 8, 2019 at 1:54 PM Jacques Nadeau wrote: > > I removing all my objections to this work. > > I wish there was more feedback from additional community members. I continue > to be concerned about fragmentation. I don't agree with the arguments here > that we need to add a

Re: [DISCUSS] C-level in-process array protocol

2019-10-08 Thread Uwe L. Korn
I'm not sure whether flatbuffers is actually an issue in the end but keeping it out of the C-API definitely simplifies it a bit adoption-wise. I don't think that though that using protobuf would make a difference here. In general, I really like the C-interface work as sadly C-APIs are still the

Re: [DISCUSS] C-level in-process array protocol

2019-10-08 Thread Jacques Nadeau
I removing all my objections to this work. I wish there was more feedback from additional community members. I continue to be concerned about fragmentation. I don't agree with the arguments here that we need to add a new api to make it easy for people to *not* use Arrow codebase. It seems like a

Re: [DISCUSS] C-level in-process array protocol

2019-10-02 Thread Micah Kornfield
Hi Wes, I agree for third-parties "A" (Field data structures) is the most useful. At least in my mind the discussion was for both first and third-parties. I was trying to point out that "A" is less necessary as a first step for first-party integrations and could potentially require more effort

Re: [DISCUSS] C-level in-process array protocol

2019-10-02 Thread Wes McKinney
On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield wrote: > > I've tried to summarize my understanding of the debate so far and give some > initial thoughts. I think there are two potentially different sets of users > that we are targeting with a stable C API/ABI ourselves and external > parties. >

Re: [DISCUSS] C-level in-process array protocol

2019-10-02 Thread Micah Kornfield
I've tried to summarize my understanding of the debate so far and give some initial thoughts. I think there are two potentially different sets of users that we are targeting with a stable C API/ABI ourselves and external parties. 1. Different language implementations within the Arrow project

Re: [DISCUSS] C-level in-process array protocol

2019-10-02 Thread Wes McKinney
On Wed, Oct 2, 2019 at 10:19 PM Wes McKinney wrote: > > On Wed, Oct 2, 2019 at 7:46 PM Jacques Nadeau wrote: > > > > I'd like to hear more opinions from others on this topic. This conversation > > seems mostly dominated by comments from myself, Wes and Antoine. > > > > I think it is reasonable

Re: [DISCUSS] C-level in-process array protocol

2019-10-02 Thread Wes McKinney
On Wed, Oct 2, 2019 at 7:46 PM Jacques Nadeau wrote: > > I'd like to hear more opinions from others on this topic. This conversation > seems mostly dominated by comments from myself, Wes and Antoine. > > I think it is reasonable to argue that keeping any ABI (or header/struct > pattern) as narrow

Re: [DISCUSS] C-level in-process array protocol

2019-10-02 Thread Jacques Nadeau
I'd like to hear more opinions from others on this topic. This conversation seems mostly dominated by comments from myself, Wes and Antoine. I think it is reasonable to argue that keeping any ABI (or header/struct pattern) as narrow as possible would allow us to minimize overlap with the existing

Re: [DISCUSS] C-level in-process array protocol

2019-10-01 Thread Wes McKinney
I had an e-mail editing snafu so you can ignore the bottom "inline" portion since it's just a restatement of what is written more clearly above On Tue, Oct 1, 2019 at 9:32 PM Wes McKinney wrote: > > hi Jacques, > > I think we've veered off course a bit and maybe we could reframe the >

Re: [DISCUSS] C-level in-process array protocol

2019-10-01 Thread Wes McKinney
hi Jacques, I think we've veered off course a bit and maybe we could reframe the discussion. Goals * A "drop-in" header-only C file that projects can use as a programming interface either internally only or to expose in-memory data structures between C functions at call sites. Ideally little to

Re: [DISCUSS] C-level in-process array protocol

2019-10-01 Thread Wes McKinney
On Tue, Oct 1, 2019 at 3:22 PM Jed Brown wrote: > > I'd just like to chime in with the use case of in-situ data analysis for > simulations. This domain tends to be cautious with dependencies and > there is a lot of C and Fortran, but the in-situ analysis tools will > preferably reside in

Re: [DISCUSS] C-level in-process array protocol

2019-10-01 Thread Antoine Pitrou
As currently designed, it's entirely in-process. Shared memory with buffer lifetime handling is taking care of by something like Plasma. Regards Antoine. Le 01/10/2019 à 22:22, Jed Brown a écrit : > I'd just like to chime in with the use case of in-situ data analysis for > simulations.

Re: [DISCUSS] C-level in-process array protocol

2019-10-01 Thread Jed Brown
I'd just like to chime in with the use case of in-situ data analysis for simulations. This domain tends to be cautious with dependencies and there is a lot of C and Fortran, but the in-situ analysis tools will preferably reside in separate processes while sharing memory via shared memory

Re: [DISCUSS] C-level in-process array protocol

2019-10-01 Thread Jacques Nadeau
I disagree with this statement: - the IPC format is meant for serialization while the C data protocol is meants for in-memory communication, so different concerns apply If that is how the a particular implementation presents it, that is a weaknesses of the implementation, not the format. The

Re: [DISCUSS] C-level in-process array protocol

2019-10-01 Thread Wes McKinney
hi Antoine, On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou wrote: > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit : > > A couple things: > > > > * I think a C protocol / FFI for Arrow array/vectors would be better > > to have the same "shape" as an assembled array. Note that the C > > structs

Re: [DISCUSS] C-level in-process array protocol

2019-10-01 Thread Antoine Pitrou
Le 01/10/2019 à 00:39, Wes McKinney a écrit : > A couple things: > > * I think a C protocol / FFI for Arrow array/vectors would be better > to have the same "shape" as an assembled array. Note that the C > structs here have very nearly the same "shape" as the data structure > representing a C++

Re: [DISCUSS] C-level in-process array protocol

2019-09-30 Thread Wes McKinney
A couple things: * I think a C protocol / FFI for Arrow array/vectors would be better to have the same "shape" as an assembled array. Note that the C structs here have very nearly the same "shape" as the data structure representing a C++ Array object [1]. The disassembly and reassembly here is

Re: [DISCUSS] C-level in-process array protocol

2019-09-30 Thread Antoine Pitrou
FlatCC is still a dependency, with generated files etc. Perhaps you want to evaluate FlatCC on a schema-like example and see what the generated code and compile instructions look like? I'll point out again that the format string in my proposal uses an extremely simple mini-format, that should

Re: [DISCUSS] C-level in-process array protocol

2019-09-30 Thread Ben Kietzman
FlatCC seems germane: https://github.com/dvidelabs/flatcc It compiles flatbuffer schemas down to (idiomatic?) C Perhaps the schema and batch serialization problems should be solved by storing everything in the flatbuffer format. Then the results of running flatcc plus a few simple helpers can be

Re: [DISCUSS] C-level in-process array protocol

2019-09-29 Thread Antoine Pitrou
One basic design point is to allow exchanging Arrow data with no mandatory dependency (the exception is JSON and base64 if you want to act on metadata - but that's highly optional, and those are extremely widespread formats). I'm afraid that Flatbuffers may be a deterrent: not only it

Re: [DISCUSS] C-level in-process array protocol

2019-09-29 Thread Antoine Pitrou
Le 29/09/2019 à 19:59, Jacques Nadeau a écrit : > > It seems like you're saying: "flatbuffers is too complex an encoding, let's > create a new encoding". Most of the spec is a plain C-level struct in the native ABI, so it avoids any kind of encoding issue. And, yes, flatbuffers must be dealt

Re: [DISCUSS] C-level in-process array protocol

2019-09-29 Thread Wes McKinney
There are two pieces of serialized data needed to communicate a record batch from one library to another * Serialized schema (i.e. what's in Schema.fbs) * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs You _do_ need to use a Flatbuffers library to fully create these

Re: [DISCUSS] C-level in-process array protocol

2019-09-29 Thread Jacques Nadeau
On Sun, Sep 29, 2019 at 12:59 AM Antoine Pitrou wrote: > > Le 29/09/2019 à 06:10, Jacques Nadeau a écrit : > > * No dependency on Flatbuffers. > > * No buffer reassembly (data is already exposed in logical Arrow format). > > * Zero-copy by design. > > * Easy to reimplement from scratch. > > > >

Re: [DISCUSS] C-level in-process array protocol

2019-09-29 Thread Antoine Pitrou
Le 29/09/2019 à 06:10, Jacques Nadeau a écrit : > * No dependency on Flatbuffers. > * No buffer reassembly (data is already exposed in logical Arrow format). > * Zero-copy by design. > * Easy to reimplement from scratch. > > I don't see how the flatbuffer pattern for data headers doesn't

Re: [DISCUSS] C-level in-process array protocol

2019-09-28 Thread Jacques Nadeau
* No dependency on Flatbuffers. * No buffer reassembly (data is already exposed in logical Arrow format). * Zero-copy by design. * Easy to reimplement from scratch. I don't see how the flatbuffer pattern for data headers doesn't accomplish all of these things. At its definition, is a very simple

Re: [DISCUSS] C-level in-process array protocol

2019-09-28 Thread Jacques Nadeau
I'm not clear on why we need to introduce something beyond what flatbuffers already provides. Can someone explain that to me? I'm not really a fan of introducing a second representation of the same data (as I understand it). On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney wrote: > This is helpful,

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Wes McKinney
This is helpful, I will leave some comments on the proposal when I can, sometime in the next week. I agree that it would likely be opening a can of worms to create a semantic mapping between a generalized type grammar and Arrow's specific logical types defined in Schema.fbs. If we go down this

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou
I've posted a draft specification PR here, this should help orient the discussion a bit: https://github.com/apache/arrow/pull/5442 Regards Antoine. On Wed, 18 Sep 2019 19:52:38 +0200 Antoine Pitrou wrote: > Hello, > > One thing that was discussed in the sync call is the ability to easily

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou
I suppose it could be possible for an Arrow array to describe itself using the ndtypes vocabulary at some point. However, this is non-trivial, both on the producer and consumer side. Moreover, both sides must ensure they use the same ndtypes description. The idea here is to have a C data

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Travis Oliphant
I know some on this list are familiar, but many may not have seen ndtypes in xnd: https://github.com/xnd-project/ndtypes It generalizes PEP 3118 for cross-language data-structure handling. Either a dependency on the small C-library libndtypes or using the concepts could be done. -Travis On

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng
On Thu, Sep 19, 2019 at 10:56 Antoine Pitrou wrote: > > Le 19/09/2019 à 19:52, Zhuo Peng a écrit : > > > > The problems are only potential and theoretical, and won't bite anyone > > until it occurs though, and it's more likely to happen with pip/wheel > than > > with conda. > > > > But anyways,

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou
Le 19/09/2019 à 19:52, Zhuo Peng a écrit : > > The problems are only potential and theoretical, and won't bite anyone > until it occurs though, and it's more likely to happen with pip/wheel than > with conda. > > But anyways, this idea is still nice. I could imagine at least in arrow's >

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng
On Thu, Sep 19, 2019 at 10:18 AM Antoine Pitrou wrote: > > No, the plan for this proposal is to avoid providing a C API. Each > Arrow implementation could produce and consume the C data protocol, for > example the C++ Array class could add these methods: > > class Array { > // ... > >

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou
Le 19/09/2019 à 19:11, Uwe L. Korn a écrit : > Hello, > > I like this proposal as it will make interfacing inside a process between > various Arrow supports much easier. I'm a bit critical though of using a > string as the format representation as one needs to parse it correctly. > Couldn't

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou
No, the plan for this proposal is to avoid providing a C API. Each Arrow implementation could produce and consume the C data protocol, for example the C++ Array class could add these methods: class Array { // ... public: // Export array to the C data protocol void Share(ArrowArray*

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Uwe L. Korn
Hello, I like this proposal as it will make interfacing inside a process between various Arrow supports much easier. I'm a bit critical though of using a string as the format representation as one needs to parse it correctly. Couldn't we use the enums we already have and reimplement them as

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Zhuo Peng
Hi Antoine, I'm also interested in a stable ABI (previously I posted on this mailing list about the ABI issues I had [1]). Does having such an ABI-stable C-struct imply that there will be a set of C-APIs exposed by the Arrow (C++) library (which I think would lead to a solution to all the inherit

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Antoine Pitrou
Le 19/09/2019 à 09:39, Micah Kornfield a écrit : > I like the idea of a stable ABI for in-processing that can be used for in > process communication. For instance, there was a recent question on > stack-overflow on how to solve this [1]. > > A couple of thoughts/questions: > * Would

Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Micah Kornfield
I like the idea of a stable ABI for in-processing that can be used for in process communication. For instance, there was a recent question on stack-overflow on how to solve this [1]. A couple of thoughts/questions: * Would ArrowArray also need a self reference for children arrays? * Should

[DISCUSS] C-level in-process array protocol

2019-09-18 Thread Antoine Pitrou
Hello, One thing that was discussed in the sync call is the ability to easily pass arrays at runtime between Arrow implementations or Arrow-supporting libraries in the same process, without bearing the cost of linking to e.g. the C++ Arrow library. (for example: "Duckdb wants to provide an