Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Phillip Cloud
Agreed. I hope that I didn't come off as flippant with respect to
performance.

I was hoping to convey that I think focusing on performance before we have
the semantics and high level design nailed down is not time well spent.

I think the current design doesn't depend on the format,
which is a good thing: we can pick the format that best suits the needs
of the community, and since performance is a big part of arrow,
that likely means picking a format that is also geared towards
performance.

On Fri, Aug 13, 2021 at 2:57 PM Keith Kraus  wrote:

> > Personally, I do not care about the speed of IR processing right now.
> > Any non-trivial (and probably trivial too) computation done
> > by an IR consumer will dwarf the cost of IR processing. Of course,
> > we shouldn't prematurely pessimize either, but there's no reason
> > to spend time worrying about IR processing performance in my opinion
> (yet).
>
> In other processing engines I've seen situations somewhat commonly where
> the time to build the compute graph becomes non-negligible and even more
> expensive than doing the computation itself. I've even seen situations
> where attempts were made to iteratively build a graph while executing in
> order to try to overlap the cost of building the graph with the compute
> execution.
>
> There's been a huge amount of effort put into optimizing critical kernel
> components like the hash table implementation in order to make Arrow the
> most performant analytical library possible. Architecting and designing the
> IR implementation without performance in mind from the beginning could
> potentially put us into a difficult situation later that we'd have to
> invest considerably more effort to work our way out of.
>
> On Fri, Aug 13, 2021 at 2:30 PM Weston Pace  wrote:
>
> > I believe you would need a JSON compatible version of the type system
> > (including binary values) because you'd need to at least encode
> > literals.  However, I don't think that creating a human readable
> > encoding of the Arrow type system is a bad thing in and of itself.  We
> > have tickets and get questions occasionally asking for a JSON format.
> > This could at least be a step in that direction.  I don't think you'd
> > need to add support for arrays/batches/tables.  Note, the C++
> > implementation has a JSON format that is used for testing purposes
> > (though I do not believe it is comprehensive).
> >
> > I think we could add two (potentially conflicting) requirements
> >  * Low barrier to entry for consumers
> >  * Low barrier to entry for producers
> >
> > JSON/YAML seem to lower the barrier to entry for producers.  Some
> > producers may not even be working with Arrow data (e.g. could one go
> > from SQL-literal -> JSON-literal skipping an intermediate
> > Arrow-literal step?).  I think we've also dismissed Antoine's earlier
> > point which I found the most compelling.  Handling flatbuffers adds
> > one more step that people have to integrate into their build systems.
> >
> > Flatbuffers on the other hand lowers the barrier to entry for
> > consumers.  A consumer is likely already going to have flatbuffers
> > support built in so that they can read/write IPC files.  If we adopt
> > JSON then the consumer will have to add support for a new file format
> > (or at least part of one).
> >
> > On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn 
> > wrote:
> > >
> > > >
> > > > I just thought of one other requirement: the format needs to support
> > > > arbitrary byte sequences.
> > > >
> > > Can you clarify why this is needed? Is it that custom_metadata maps
> > should
> > > allow byte sequences as values?
> > >
> > > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud 
> > wrote:
> > >
> > > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou 
> > > > wrote:
> > > >
> > > > >
> > > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > > > > >
> > > > > >> I.e. make the ability to read and write by humans be more
> > important
> > > > than
> > > > > >> speed of validation.
> > > > > >
> > > > > > I think I differ on whether the IR should be easy to read and
> > write by
> > > > > > humans.
> > > > > > IR is going to be predominantly read and written by machines,
> > though of
> > > > > > course
> > > > > > we will need a way to inspect it for debugging.
> > > > >
> > > > > But the code executed by machines is written by humans.  I think
> > that's
> > > > > mostly where the contention resides: is it easy to code, in any
> given
> > > > > language, the routines required to produce or consume the IR?
> > > > >
> > > >
> > > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to
> > use in
> > > > any language except C++,
> > > > and it's borderline annoying there too. Protobuf is similar (less
> > annoying
> > > > in Rust,
> > > > but still annoying in Python and C++ IMO), though I think any binary
> > format
> > > > is going to be
> > > > less human-friendly, by construction.
> > > >
> > > > If we were to use something like JSON or 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Keith Kraus
> Personally, I do not care about the speed of IR processing right now.
> Any non-trivial (and probably trivial too) computation done
> by an IR consumer will dwarf the cost of IR processing. Of course,
> we shouldn't prematurely pessimize either, but there's no reason
> to spend time worrying about IR processing performance in my opinion
(yet).

In other processing engines I've seen situations somewhat commonly where
the time to build the compute graph becomes non-negligible and even more
expensive than doing the computation itself. I've even seen situations
where attempts were made to iteratively build a graph while executing in
order to try to overlap the cost of building the graph with the compute
execution.

There's been a huge amount of effort put into optimizing critical kernel
components like the hash table implementation in order to make Arrow the
most performant analytical library possible. Architecting and designing the
IR implementation without performance in mind from the beginning could
potentially put us into a difficult situation later that we'd have to
invest considerably more effort to work our way out of.

On Fri, Aug 13, 2021 at 2:30 PM Weston Pace  wrote:

> I believe you would need a JSON compatible version of the type system
> (including binary values) because you'd need to at least encode
> literals.  However, I don't think that creating a human readable
> encoding of the Arrow type system is a bad thing in and of itself.  We
> have tickets and get questions occasionally asking for a JSON format.
> This could at least be a step in that direction.  I don't think you'd
> need to add support for arrays/batches/tables.  Note, the C++
> implementation has a JSON format that is used for testing purposes
> (though I do not believe it is comprehensive).
>
> I think we could add two (potentially conflicting) requirements
>  * Low barrier to entry for consumers
>  * Low barrier to entry for producers
>
> JSON/YAML seem to lower the barrier to entry for producers.  Some
> producers may not even be working with Arrow data (e.g. could one go
> from SQL-literal -> JSON-literal skipping an intermediate
> Arrow-literal step?).  I think we've also dismissed Antoine's earlier
> point which I found the most compelling.  Handling flatbuffers adds
> one more step that people have to integrate into their build systems.
>
> Flatbuffers on the other hand lowers the barrier to entry for
> consumers.  A consumer is likely already going to have flatbuffers
> support built in so that they can read/write IPC files.  If we adopt
> JSON then the consumer will have to add support for a new file format
> (or at least part of one).
>
> On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn 
> wrote:
> >
> > >
> > > I just thought of one other requirement: the format needs to support
> > > arbitrary byte sequences.
> > >
> > Can you clarify why this is needed? Is it that custom_metadata maps
> should
> > allow byte sequences as values?
> >
> > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud 
> wrote:
> >
> > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou 
> > > wrote:
> > >
> > > >
> > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > > > >
> > > > >> I.e. make the ability to read and write by humans be more
> important
> > > than
> > > > >> speed of validation.
> > > > >
> > > > > I think I differ on whether the IR should be easy to read and
> write by
> > > > > humans.
> > > > > IR is going to be predominantly read and written by machines,
> though of
> > > > > course
> > > > > we will need a way to inspect it for debugging.
> > > >
> > > > But the code executed by machines is written by humans.  I think
> that's
> > > > mostly where the contention resides: is it easy to code, in any given
> > > > language, the routines required to produce or consume the IR?
> > > >
> > >
> > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to
> use in
> > > any language except C++,
> > > and it's borderline annoying there too. Protobuf is similar (less
> annoying
> > > in Rust,
> > > but still annoying in Python and C++ IMO), though I think any binary
> format
> > > is going to be
> > > less human-friendly, by construction.
> > >
> > > If we were to use something like JSON or msgpack, can someone sketch
> out
> > > the interaction
> > > between the IR and the rest of arrow's type system?
> > >
> > > Would we need a JSON-encoded-arrow-type -> in-memory representation
> for an
> > > Arrow type in a given language?
> > >
> > > I just thought of one other requirement: the format needs to support
> > > arbitrary byte sequences. JSON
> > > doesn't support untransformed byte sequences, though it's not uncommon
> to
> > > base64-encode a byte sequence.
> > > IMO that adds an unnecessary layer of complexity, which is another
> tradeoff
> > > to consider.
> > >
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Weston Pace
I believe you would need a JSON compatible version of the type system
(including binary values) because you'd need to at least encode
literals.  However, I don't think that creating a human readable
encoding of the Arrow type system is a bad thing in and of itself.  We
have tickets and get questions occasionally asking for a JSON format.
This could at least be a step in that direction.  I don't think you'd
need to add support for arrays/batches/tables.  Note, the C++
implementation has a JSON format that is used for testing purposes
(though I do not believe it is comprehensive).

I think we could add two (potentially conflicting) requirements
 * Low barrier to entry for consumers
 * Low barrier to entry for producers

JSON/YAML seem to lower the barrier to entry for producers.  Some
producers may not even be working with Arrow data (e.g. could one go
from SQL-literal -> JSON-literal skipping an intermediate
Arrow-literal step?).  I think we've also dismissed Antoine's earlier
point which I found the most compelling.  Handling flatbuffers adds
one more step that people have to integrate into their build systems.

Flatbuffers on the other hand lowers the barrier to entry for
consumers.  A consumer is likely already going to have flatbuffers
support built in so that they can read/write IPC files.  If we adopt
JSON then the consumer will have to add support for a new file format
(or at least part of one).

On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn  wrote:
>
> >
> > I just thought of one other requirement: the format needs to support
> > arbitrary byte sequences.
> >
> Can you clarify why this is needed? Is it that custom_metadata maps should
> allow byte sequences as values?
>
> On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud  wrote:
>
> > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou 
> > wrote:
> >
> > >
> > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > > >
> > > >> I.e. make the ability to read and write by humans be more important
> > than
> > > >> speed of validation.
> > > >
> > > > I think I differ on whether the IR should be easy to read and write by
> > > > humans.
> > > > IR is going to be predominantly read and written by machines, though of
> > > > course
> > > > we will need a way to inspect it for debugging.
> > >
> > > But the code executed by machines is written by humans.  I think that's
> > > mostly where the contention resides: is it easy to code, in any given
> > > language, the routines required to produce or consume the IR?
> > >
> >
> > Definitely not for flatbuffers, since flatbuffers is IMO annoying to use in
> > any language except C++,
> > and it's borderline annoying there too. Protobuf is similar (less annoying
> > in Rust,
> > but still annoying in Python and C++ IMO), though I think any binary format
> > is going to be
> > less human-friendly, by construction.
> >
> > If we were to use something like JSON or msgpack, can someone sketch out
> > the interaction
> > between the IR and the rest of arrow's type system?
> >
> > Would we need a JSON-encoded-arrow-type -> in-memory representation for an
> > Arrow type in a given language?
> >
> > I just thought of one other requirement: the format needs to support
> > arbitrary byte sequences. JSON
> > doesn't support untransformed byte sequences, though it's not uncommon to
> > base64-encode a byte sequence.
> > IMO that adds an unnecessary layer of complexity, which is another tradeoff
> > to consider.
> >


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Jacob Quinn
>
> I just thought of one other requirement: the format needs to support
> arbitrary byte sequences.
>
Can you clarify why this is needed? Is it that custom_metadata maps should
allow byte sequences as values?

On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud  wrote:

> On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou 
> wrote:
>
> >
> > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > >
> > >> I.e. make the ability to read and write by humans be more important
> than
> > >> speed of validation.
> > >
> > > I think I differ on whether the IR should be easy to read and write by
> > > humans.
> > > IR is going to be predominantly read and written by machines, though of
> > > course
> > > we will need a way to inspect it for debugging.
> >
> > But the code executed by machines is written by humans.  I think that's
> > mostly where the contention resides: is it easy to code, in any given
> > language, the routines required to produce or consume the IR?
> >
>
> Definitely not for flatbuffers, since flatbuffers is IMO annoying to use in
> any language except C++,
> and it's borderline annoying there too. Protobuf is similar (less annoying
> in Rust,
> but still annoying in Python and C++ IMO), though I think any binary format
> is going to be
> less human-friendly, by construction.
>
> If we were to use something like JSON or msgpack, can someone sketch out
> the interaction
> between the IR and the rest of arrow's type system?
>
> Would we need a JSON-encoded-arrow-type -> in-memory representation for an
> Arrow type in a given language?
>
> I just thought of one other requirement: the format needs to support
> arbitrary byte sequences. JSON
> doesn't support untransformed byte sequences, though it's not uncommon to
> base64-encode a byte sequence.
> IMO that adds an unnecessary layer of complexity, which is another tradeoff
> to consider.
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Phillip Cloud
On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou  wrote:

>
> Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> >
> >> I.e. make the ability to read and write by humans be more important than
> >> speed of validation.
> >
> > I think I differ on whether the IR should be easy to read and write by
> > humans.
> > IR is going to be predominantly read and written by machines, though of
> > course
> > we will need a way to inspect it for debugging.
>
> But the code executed by machines is written by humans.  I think that's
> mostly where the contention resides: is it easy to code, in any given
> language, the routines required to produce or consume the IR?
>

Definitely not for flatbuffers, since flatbuffers is IMO annoying to use in
any language except C++,
and it's borderline annoying there too. Protobuf is similar (less annoying
in Rust,
but still annoying in Python and C++ IMO), though I think any binary format
is going to be
less human-friendly, by construction.

If we were to use something like JSON or msgpack, can someone sketch out
the interaction
between the IR and the rest of arrow's type system?

Would we need a JSON-encoded-arrow-type -> in-memory representation for an
Arrow type in a given language?

I just thought of one other requirement: the format needs to support
arbitrary byte sequences. JSON
doesn't support untransformed byte sequences, though it's not uncommon to
base64-encode a byte sequence.
IMO that adds an unnecessary layer of complexity, which is another tradeoff
to consider.


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Antoine Pitrou



Le 13/08/2021 à 17:35, Phillip Cloud a écrit :



I.e. make the ability to read and write by humans be more important than
speed of validation.


I think I differ on whether the IR should be easy to read and write by
humans.
IR is going to be predominantly read and written by machines, though of
course
we will need a way to inspect it for debugging.


But the code executed by machines is written by humans.  I think that's 
mostly where the contention resides: is it easy to code, in any given 
language, the routines required to produce or consume the IR?




Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Phillip Cloud
On Fri, Aug 13, 2021 at 8:03 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> The requirements for the compute IR as I see it are:
> >
> > * Implementations in IR producer and consumer languages.
> > * Strongly typed or the ability to easily validate a payload
> >
>
> What about:
>
> 1. easy to read and write by a large number of programming languages
>

Personally, I do not care about the speed of IR processing right now.
Any non-trivial (and probably trivial too) computation done
by an IR consumer will dwarf the cost of IR processing. Of course,
we shouldn't prematurely pessimize either, but there's no reason
to spend time worrying about IR processing performance in my opinion (yet).


> 2. easy to read and write by humans
>

I think this is where I differ. Would you accept

"easy to transform into something that can be read and written by humans"

?

For example, you can turn a flatbuffer blob into its JSON equivalent using
a few command line flags passed to flatc.

That way, the IR can be flatbuffers, but if at any point someone wants to
look at
some other than a meaningless blob of bytes, they can.


> 3. fast to validate by a large number of programming languages
>

I guess it depends on what fast means here, as well as the programming
language
and implementation of the validator. In my view, this falls under "let's
not worry
worry about performance yet". To that point, I think a structured format
like
protobuf or flatbuffers let's us punt on performance for now. A
counter-argument
might be "if we're punting on performance, then why not pick the one that's
easiest
to debug?" My only answer to that is reuse of existing
flatbuffers types, which requires some work (at some point) to figure out
how to distribute the generated code. With JSON/TOML/YAML we would
have to build that. Maybe it's not a lot of effort, but I guess my
inclination is
to write more CI code, rather than library code if that's an option :)


>
> I.e. make the ability to read and write by humans be more important than
> speed of validation.


I think I differ on whether the IR should be easy to read and write by
humans.
IR is going to be predominantly read and written by machines, though of
course
we will need a way to inspect it for debugging.


>
> In this order, JSON/toml/yaml are preferred because they are supported by
> more languages and more human readable than faster to validate.
>
> -
>
> My understanding is that for an async experience, we need the ability to
> `.await` at any `read_X` call so that if the read_X requests more bytes
> than are buffered, the `read_X(...).await` triggers a new (async) request
> to fill the buffer (which puts the future on a Pending state). When a
> library does not offer the async version of `read_X`, any read_X can force
> a request to fill the buffer, which is now blocking the thread. One way
> around this is to wrap those blocking calls in async (e.g. via
> tokio::spawn_blocking). However, this forces users to use that runtime, or
> to create a new independent thread pool for their own async work. Neither
> are great for low-level libraries.
>
>
I think I'm still missing something here.

You can asynchronously read arbitrary byte sequences from a wide variety
of IO sources and then parse the bytes into the desired format.

I don't follow why that isn't sufficient to take advantage of async.

A library like tonic for example, doesn't require that prost implement
async APIs (I still don't know what that would mean for an in-memory
format),
yet tonic takes full advantage of async. In fact, I think it's _only_ async.

I could understand the desire for a library to provide something like a
capital-S
Stream where the bytes are consumed asynchronously. Is that
what you're after here?


> E.g. thrift does not offer async -> parquet-format-rs does not offer async
> -> parquet does not offer async -> datafusion wraps all parquet "IO-bounded
> and CPU-bounded operations" in spawn_blocking or something equivalent.


> Best,
> Jorge
>
>
> On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud  wrote:
>
> > On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > I agree with Antoine that we should weigh the pros and cons of
> > flatbuffers
> > > (or protobuf or thrift for that matter) over a more human-friendly,
> > > simpler, format like json or MsgPack. I also struggle a bit to reason
> > with
> > > the complexity of using flatbuffers for this.
> > >
> >
> > Ultimately I think different representations of the format will emerge if
> > compute IR is successful,
> > and people will implement JSON/proto/thrift/etc versions of the IR.
> >
> > The requirements for the compute IR as I see it are:
> >
> > * Implementations in IR producer and consumer languages.
> > * Strongly typed or the ability to easily validate a payload
> >
> > It seems like Protobuf, Flatbuffers and JSON all meet the criteria here.
> > Beyond that,
> > there's precedence in 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Wes McKinney
On Fri, Aug 13, 2021 at 2:03 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> The requirements for the compute IR as I see it are:
> >
> > * Implementations in IR producer and consumer languages.
> > * Strongly typed or the ability to easily validate a payload
> >
>
> What about:
>
> 1. easy to read and write by a large number of programming languages
> 2. easy to read and write by humans
> 3. fast to validate by a large number of programming languages
>
> I.e. make the ability to read and write by humans be more important than
> speed of validation.
>
> In this order, JSON/toml/yaml are preferred because they are supported by
> more languages and more human readable than faster to validate.
>

I am not sure that using JSON would make the IR “faster to validate”
because the validation I believe we care more about is that the IR is
consistent with the specification. When you use Flatbuffers, the schema
verifier is built-in the library. With JSON, many implementations must
determine for themselves whether the data is incorrectly constructed (there
are of course libraries and frameworks available which help with enforcing
JSON schemas nowadays).

I think it would be fine to have a JSON alternative format for the IR but
as the canonical/primary representation I believe it would make for a
net-higher implementation burden (to make something really robust at least)
for IR users.


> -
>
> My understanding is that for an async experience, we need the ability to
> `.await` at any `read_X` call so that if the read_X requests more bytes
> than are buffered, the `read_X(...).await` triggers a new (async) request
> to fill the buffer (which puts the future on a Pending state). When a
> library does not offer the async version of `read_X`, any read_X can force
> a request to fill the buffer, which is now blocking the thread. One way
> around this is to wrap those blocking calls in async (e.g. via
> tokio::spawn_blocking). However, this forces users to use that runtime, or
> to create a new independent thread pool for their own async work. Neither
> are great for low-level libraries.
>
> E.g. thrift does not offer async -> parquet-format-rs does not offer async
> -> parquet does not offer async -> datafusion wraps all parquet "IO-bounded
> and CPU-bounded operations" in spawn_blocking or something equivalent.
>
> Best,
> Jorge
>
>
> On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud  wrote:
>
> > On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > I agree with Antoine that we should weigh the pros and cons of
> > flatbuffers
> > > (or protobuf or thrift for that matter) over a more human-friendly,
> > > simpler, format like json or MsgPack. I also struggle a bit to reason
> > with
> > > the complexity of using flatbuffers for this.
> > >
> >
> > Ultimately I think different representations of the format will emerge if
> > compute IR is successful,
> > and people will implement JSON/proto/thrift/etc versions of the IR.
> >
> > The requirements for the compute IR as I see it are:
> >
> > * Implementations in IR producer and consumer languages.
> > * Strongly typed or the ability to easily validate a payload
> >
> > It seems like Protobuf, Flatbuffers and JSON all meet the criteria here.
> > Beyond that,
> > there's precedence in the codebase for flatbuffers (which is just to say
> > that flatbuffers
> > is the devil we know).
> >
> > Can people list other concrete requirements for the format? A
> > non-requirement might
> > be that there be _idiomatic_ implementations for every language arrow
> > supports, for example.
> >
> > I think without agreement on requirements we won't ever arrive at
> > consensus.
> >
> > The compute IR spec itself doesn't really depend on the specific choice
> of
> > format, but we
> > need to get some consensus on the format.
> >
> >
> > > E.g. there is no async support for thrift, flatbuffers nor protobuf in
> > > Rust, which e.g. means that we can't read neither parquet nor arrow IPC
> > > async atm. These problems are usually easier to work around in simpler
> > > formats.
> > >
> >
> > Can you elaborate a bit on the lack of async support here and what it
> would
> > mean for
> > a particular in-memory representation to support async, and why that
> > prevents reading
> > a parquet file using async?
> >
> > Looking at JSON as an example, most libraries in the Rust ecosystem use
> > serde and serde_json
> > to serialize and deserialize JSON, and any async concerns occur at the
> > level of
> > a client/server library like warp (or some transitive dependency thereof
> > like Hyper).
> >
> > Are you referring to something like the functionality implemented in
> > tokio-serde-json? If so,
> > I think you could probably build something for these other formats
> assuming
> > they have serde
> > support (flatbuffers notably does _not_, partially because of its
> incessant
> > need to own everything),
> > since tokio_serde is 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Jorge Cardoso Leitão
Hi,

The requirements for the compute IR as I see it are:
>
> * Implementations in IR producer and consumer languages.
> * Strongly typed or the ability to easily validate a payload
>

What about:

1. easy to read and write by a large number of programming languages
2. easy to read and write by humans
3. fast to validate by a large number of programming languages

I.e. make the ability to read and write by humans be more important than
speed of validation.

In this order, JSON/toml/yaml are preferred because they are supported by
more languages and more human readable than faster to validate.

-

My understanding is that for an async experience, we need the ability to
`.await` at any `read_X` call so that if the read_X requests more bytes
than are buffered, the `read_X(...).await` triggers a new (async) request
to fill the buffer (which puts the future on a Pending state). When a
library does not offer the async version of `read_X`, any read_X can force
a request to fill the buffer, which is now blocking the thread. One way
around this is to wrap those blocking calls in async (e.g. via
tokio::spawn_blocking). However, this forces users to use that runtime, or
to create a new independent thread pool for their own async work. Neither
are great for low-level libraries.

E.g. thrift does not offer async -> parquet-format-rs does not offer async
-> parquet does not offer async -> datafusion wraps all parquet "IO-bounded
and CPU-bounded operations" in spawn_blocking or something equivalent.

Best,
Jorge


On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud  wrote:

> On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > I agree with Antoine that we should weigh the pros and cons of
> flatbuffers
> > (or protobuf or thrift for that matter) over a more human-friendly,
> > simpler, format like json or MsgPack. I also struggle a bit to reason
> with
> > the complexity of using flatbuffers for this.
> >
>
> Ultimately I think different representations of the format will emerge if
> compute IR is successful,
> and people will implement JSON/proto/thrift/etc versions of the IR.
>
> The requirements for the compute IR as I see it are:
>
> * Implementations in IR producer and consumer languages.
> * Strongly typed or the ability to easily validate a payload
>
> It seems like Protobuf, Flatbuffers and JSON all meet the criteria here.
> Beyond that,
> there's precedence in the codebase for flatbuffers (which is just to say
> that flatbuffers
> is the devil we know).
>
> Can people list other concrete requirements for the format? A
> non-requirement might
> be that there be _idiomatic_ implementations for every language arrow
> supports, for example.
>
> I think without agreement on requirements we won't ever arrive at
> consensus.
>
> The compute IR spec itself doesn't really depend on the specific choice of
> format, but we
> need to get some consensus on the format.
>
>
> > E.g. there is no async support for thrift, flatbuffers nor protobuf in
> > Rust, which e.g. means that we can't read neither parquet nor arrow IPC
> > async atm. These problems are usually easier to work around in simpler
> > formats.
> >
>
> Can you elaborate a bit on the lack of async support here and what it would
> mean for
> a particular in-memory representation to support async, and why that
> prevents reading
> a parquet file using async?
>
> Looking at JSON as an example, most libraries in the Rust ecosystem use
> serde and serde_json
> to serialize and deserialize JSON, and any async concerns occur at the
> level of
> a client/server library like warp (or some transitive dependency thereof
> like Hyper).
>
> Are you referring to something like the functionality implemented in
> tokio-serde-json? If so,
> I think you could probably build something for these other formats assuming
> they have serde
> support (flatbuffers notably does _not_, partially because of its incessant
> need to own everything),
> since tokio_serde is doing most of the work in tokio-serde-json. In any
> case, I don't think
> it's a requirement for the compute IR that there be a streaming transport
> implementation for the
> format.
>
>
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 12/08/2021 à 15:05, Wes McKinney a écrit :
> > > > It seems that one adjacent problem here is how to make it simpler for
> > > > third parties (especially ones that act as front end interfaces) to
> > > > build and serialize/deserialize the IR structures with some kind of
> > > > ready-to-go middleware library, written in a language like C++.
> > >
> > > A C++ library sounds a bit complicated to deal with for Java, Rust, Go,
> > > etc. developers.
> > >
> > > I'm not sure which design decision and set of compromises would make
> the
> > > most sense.  But this is why I'm asking the question "why not JSON?" (+
> > > JSON-Schema if you want to ease validation by third parties).
> > >
> > 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Phillip Cloud
On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> I agree with Antoine that we should weigh the pros and cons of flatbuffers
> (or protobuf or thrift for that matter) over a more human-friendly,
> simpler, format like json or MsgPack. I also struggle a bit to reason with
> the complexity of using flatbuffers for this.
>

Ultimately I think different representations of the format will emerge if
compute IR is successful,
and people will implement JSON/proto/thrift/etc versions of the IR.

The requirements for the compute IR as I see it are:

* Implementations in IR producer and consumer languages.
* Strongly typed or the ability to easily validate a payload

It seems like Protobuf, Flatbuffers and JSON all meet the criteria here.
Beyond that,
there's precedence in the codebase for flatbuffers (which is just to say
that flatbuffers
is the devil we know).

Can people list other concrete requirements for the format? A
non-requirement might
be that there be _idiomatic_ implementations for every language arrow
supports, for example.

I think without agreement on requirements we won't ever arrive at consensus.

The compute IR spec itself doesn't really depend on the specific choice of
format, but we
need to get some consensus on the format.


> E.g. there is no async support for thrift, flatbuffers nor protobuf in
> Rust, which e.g. means that we can't read neither parquet nor arrow IPC
> async atm. These problems are usually easier to work around in simpler
> formats.
>

Can you elaborate a bit on the lack of async support here and what it would
mean for
a particular in-memory representation to support async, and why that
prevents reading
a parquet file using async?

Looking at JSON as an example, most libraries in the Rust ecosystem use
serde and serde_json
to serialize and deserialize JSON, and any async concerns occur at the
level of
a client/server library like warp (or some transitive dependency thereof
like Hyper).

Are you referring to something like the functionality implemented in
tokio-serde-json? If so,
I think you could probably build something for these other formats assuming
they have serde
support (flatbuffers notably does _not_, partially because of its incessant
need to own everything),
since tokio_serde is doing most of the work in tokio-serde-json. In any
case, I don't think
it's a requirement for the compute IR that there be a streaming transport
implementation for the
format.


>
> Best,
> Jorge
>
>
>
> On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou  wrote:
>
> >
> > Le 12/08/2021 à 15:05, Wes McKinney a écrit :
> > > It seems that one adjacent problem here is how to make it simpler for
> > > third parties (especially ones that act as front end interfaces) to
> > > build and serialize/deserialize the IR structures with some kind of
> > > ready-to-go middleware library, written in a language like C++.
> >
> > A C++ library sounds a bit complicated to deal with for Java, Rust, Go,
> > etc. developers.
> >
> > I'm not sure which design decision and set of compromises would make the
> > most sense.  But this is why I'm asking the question "why not JSON?" (+
> > JSON-Schema if you want to ease validation by third parties).
> >
> > (note I have already mentioned MsgPack, but only in the case a binary
> > encoding is really required; it doesn't have any other advantage that I
> > know of over JSON, and it's less ubiquitous)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > > To do that, one would need the equivalent of arrow/type.h and related
> > > Flatbuffers schema serialization code that lives in arrow/ipc. If you
> > > want to be able to completely and accurately serialize Schemas, you
> > > need quite a bit of code now.
> > >
> > > One possible approach (and not go crazy) would be to:
> > >
> > > * Move arrow/types.h and its dependencies into a standalone C++
> > > library that can be vendored into the main apache/arrow C++ library. I
> > > don't know how onerous arrow/types.h's transitive dependencies /
> > > interactions are at this point (there's a lot of stuff going on in
> > > type.cc [1] now)
> > > * Make the namespaces exported by this library configurable, so any
> > > library can vendor the Arrow types / IR builder APIs privately into
> > > their project
> > > * Maintain this "Arrow types and ComputeIR library" as an always
> > > zero-dependency library to facilitate vendoring
> > > * Lightweight bindings in languages we care about (like Python or R or
> > > GLib/Ruby) could be built to the IR builder middleware library
> > >
> > > This seems like what is more at issue compared with rather projects
> > > are copying the Flatbuffers files out of their project from
> > > apache/arrow or apache/arrow-format.
> > >
> > > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
> > >
> > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb 
> > wrote:
> > >>
> > >> I support the idea of an independent repo that has the arrow
> flatbuffers
> > >> 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Jorge Cardoso Leitão
I agree with Antoine that we should weigh the pros and cons of flatbuffers
(or protobuf or thrift for that matter) over a more human-friendly,
simpler, format like json or MsgPack. I also struggle a bit to reason with
the complexity of using flatbuffers for this.

E.g. there is no async support for thrift, flatbuffers nor protobuf in
Rust, which e.g. means that we can't read neither parquet nor arrow IPC
async atm. These problems are usually easier to work around in simpler
formats.

Best,
Jorge



On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou  wrote:

>
> Le 12/08/2021 à 15:05, Wes McKinney a écrit :
> > It seems that one adjacent problem here is how to make it simpler for
> > third parties (especially ones that act as front end interfaces) to
> > build and serialize/deserialize the IR structures with some kind of
> > ready-to-go middleware library, written in a language like C++.
>
> A C++ library sounds a bit complicated to deal with for Java, Rust, Go,
> etc. developers.
>
> I'm not sure which design decision and set of compromises would make the
> most sense.  But this is why I'm asking the question "why not JSON?" (+
> JSON-Schema if you want to ease validation by third parties).
>
> (note I have already mentioned MsgPack, but only in the case a binary
> encoding is really required; it doesn't have any other advantage that I
> know of over JSON, and it's less ubiquitous)
>
> Regards
>
> Antoine.
>
>
> > To do that, one would need the equivalent of arrow/type.h and related
> > Flatbuffers schema serialization code that lives in arrow/ipc. If you
> > want to be able to completely and accurately serialize Schemas, you
> > need quite a bit of code now.
> >
> > One possible approach (and not go crazy) would be to:
> >
> > * Move arrow/types.h and its dependencies into a standalone C++
> > library that can be vendored into the main apache/arrow C++ library. I
> > don't know how onerous arrow/types.h's transitive dependencies /
> > interactions are at this point (there's a lot of stuff going on in
> > type.cc [1] now)
> > * Make the namespaces exported by this library configurable, so any
> > library can vendor the Arrow types / IR builder APIs privately into
> > their project
> > * Maintain this "Arrow types and ComputeIR library" as an always
> > zero-dependency library to facilitate vendoring
> > * Lightweight bindings in languages we care about (like Python or R or
> > GLib/Ruby) could be built to the IR builder middleware library
> >
> > This seems like what is more at issue compared with rather projects
> > are copying the Flatbuffers files out of their project from
> > apache/arrow or apache/arrow-format.
> >
> > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
> >
> > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb 
> wrote:
> >>
> >> I support the idea of an independent repo that has the arrow flatbuffers
> >> format definition files.
> >>
> >> My rationale is that the Rust implementation has a copy of the `format`
> >> directory [1] and potential drift worries me (a bit). Having a single
> >> source of truth for the format that is not part of the large mono repo
> >> would be a good thing.
> >>
> >> Andrew
> >>
> >> [1] https://github.com/apache/arrow-rs/tree/master/format
> >>
> >> On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud 
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I'd like to bring up an idea from a recent thread ([1]) about moving
> the
> >>> `format/` directory out of the primary apache/arrow repository.
> >>>
> >>> I understand from that thread there are some concerns about using
> >>> submodules,
> >>> and I definitely sympathize with those concerns.
> >>>
> >>> In talking with David Li (disclaimer: we work together at Voltron
> Data), he
> >>> has
> >>> a great idea that I think makes everyone happy: an
> `apache/arrow-format`
> >>> repository that is the official mirror for the flatbuffers IDL, that
> >>> library
> >>> authors should use as the source of truth.
> >>>
> >>> It doesn't require a submodule, yet it also allows external projects
> the
> >>> ability to access the IDL without having to interact with the main
> arrow
> >>> repository and is backwards compatible to boot.
> >>>
> >>> In this scenario, repositories that are currently copying in the
> >>> flatbuffers
> >>> IDL can migrate to this repository at their leisure.
> >>>
> >>> My motivation for this was around sharing data structures for the
> compute
> >>> IR
> >>> proposal ([2]).
> >>>
> >>> I can think of at least two ways for IR producers and consumers of all
> >>> languages to share the flatbuffers IDL:
> >>>
> >>> 1. A set of bindings built in some language that other languages can
> >>> integrate
> >>> with, likely C++, that allows library users to build IR using an
> API.
> >>>
> >>> The primary downside to this is that we'd have to deal with
> >>> building another library while working out any kinks in the IR design
> and
> >>> I'd
> >>> rather avoid that in the initial phases of this 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Antoine Pitrou



Le 12/08/2021 à 15:05, Wes McKinney a écrit :

It seems that one adjacent problem here is how to make it simpler for
third parties (especially ones that act as front end interfaces) to
build and serialize/deserialize the IR structures with some kind of
ready-to-go middleware library, written in a language like C++.


A C++ library sounds a bit complicated to deal with for Java, Rust, Go, 
etc. developers.


I'm not sure which design decision and set of compromises would make the 
most sense.  But this is why I'm asking the question "why not JSON?" (+ 
JSON-Schema if you want to ease validation by third parties).


(note I have already mentioned MsgPack, but only in the case a binary 
encoding is really required; it doesn't have any other advantage that I 
know of over JSON, and it's less ubiquitous)


Regards

Antoine.



To do that, one would need the equivalent of arrow/type.h and related
Flatbuffers schema serialization code that lives in arrow/ipc. If you
want to be able to completely and accurately serialize Schemas, you
need quite a bit of code now.

One possible approach (and not go crazy) would be to:

* Move arrow/types.h and its dependencies into a standalone C++
library that can be vendored into the main apache/arrow C++ library. I
don't know how onerous arrow/types.h's transitive dependencies /
interactions are at this point (there's a lot of stuff going on in
type.cc [1] now)
* Make the namespaces exported by this library configurable, so any
library can vendor the Arrow types / IR builder APIs privately into
their project
* Maintain this "Arrow types and ComputeIR library" as an always
zero-dependency library to facilitate vendoring
* Lightweight bindings in languages we care about (like Python or R or
GLib/Ruby) could be built to the IR builder middleware library

This seems like what is more at issue compared with rather projects
are copying the Flatbuffers files out of their project from
apache/arrow or apache/arrow-format.

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc

On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb  wrote:


I support the idea of an independent repo that has the arrow flatbuffers
format definition files.

My rationale is that the Rust implementation has a copy of the `format`
directory [1] and potential drift worries me (a bit). Having a single
source of truth for the format that is not part of the large mono repo
would be a good thing.

Andrew

[1] https://github.com/apache/arrow-rs/tree/master/format

On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud  wrote:


Hi all,

I'd like to bring up an idea from a recent thread ([1]) about moving the
`format/` directory out of the primary apache/arrow repository.

I understand from that thread there are some concerns about using
submodules,
and I definitely sympathize with those concerns.

In talking with David Li (disclaimer: we work together at Voltron Data), he
has
a great idea that I think makes everyone happy: an `apache/arrow-format`
repository that is the official mirror for the flatbuffers IDL, that
library
authors should use as the source of truth.

It doesn't require a submodule, yet it also allows external projects the
ability to access the IDL without having to interact with the main arrow
repository and is backwards compatible to boot.

In this scenario, repositories that are currently copying in the
flatbuffers
IDL can migrate to this repository at their leisure.

My motivation for this was around sharing data structures for the compute
IR
proposal ([2]).

I can think of at least two ways for IR producers and consumers of all
languages to share the flatbuffers IDL:

1. A set of bindings built in some language that other languages can
integrate
with, likely C++, that allows library users to build IR using an API.

The primary downside to this is that we'd have to deal with
building another library while working out any kinks in the IR design and
I'd
rather avoid that in the initial phases of this project.

The benefit is that IR components don't interact much with `flatbuffers` or
`flatc` directly.

2. A single location where the format lives, that doesn't require depending
on
a large multi-language repository to access a handful of files.

I think the downside to this is that there's a bit of additional
infrastructure
to automate copying in `arrow-format`.

The benefit there is that producers and consumers can immediately start
getting
value from compute IR without having to wait for development of a new API.

One counter-proposal might be to just put the compute IR IDL in a separate
repo,
but that isn't tenable because the compute IR needs arrow's type
information
contained in `Schema.fbs`.

I was hoping to avoid conflating the discussion about bindings vs direct
flatbuffer usage (at least initially just supporting one, I predict we'll
need
both ultimately) with the decision about whether to split out the format
directory, but it's a good example of a choice for which splitting out 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Wes McKinney
On Thu, Aug 12, 2021 at 3:16 PM Neal Richardson
 wrote:
>
> > Maintain this "Arrow types and ComputeIR library" as an always
> zero-dependency library to facilitate vendoring
>
> Would/should this hypothetical zero-dep, vendorable library also include
> the IPC format? Or if you want to interact with IPC in that case, the C
> data interface is the best/only option?

No, to do anything with the IPC format would pull in arrow::Buffer,
arrow::Array, and many other inextricable components which are used
with the IPC read/write implementation.

> Or if you want to interact with IPC in that case, the C data interface is the 
> best/only option?

I'm not clear on what you mean since the C data interface is only for
data interchange at function call sites in-process, and not for
serialization (interprocess).

> On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney  wrote:
>
> > It seems that one adjacent problem here is how to make it simpler for
> > third parties (especially ones that act as front end interfaces) to
> > build and serialize/deserialize the IR structures with some kind of
> > ready-to-go middleware library, written in a language like C++.
> >
> > To do that, one would need the equivalent of arrow/type.h and related
> > Flatbuffers schema serialization code that lives in arrow/ipc. If you
> > want to be able to completely and accurately serialize Schemas, you
> > need quite a bit of code now.
> >
> > One possible approach (and not go crazy) would be to:
> >
> > * Move arrow/types.h and its dependencies into a standalone C++
> > library that can be vendored into the main apache/arrow C++ library. I
> > don't know how onerous arrow/types.h's transitive dependencies /
> > interactions are at this point (there's a lot of stuff going on in
> > type.cc [1] now)
> > * Make the namespaces exported by this library configurable, so any
> > library can vendor the Arrow types / IR builder APIs privately into
> > their project
> > * Maintain this "Arrow types and ComputeIR library" as an always
> > zero-dependency library to facilitate vendoring
> > * Lightweight bindings in languages we care about (like Python or R or
> > GLib/Ruby) could be built to the IR builder middleware library
> >
> > This seems like what is more at issue compared with rather projects
> > are copying the Flatbuffers files out of their project from
> > apache/arrow or apache/arrow-format.
> >
> > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
> >
> > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb  wrote:
> > >
> > > I support the idea of an independent repo that has the arrow flatbuffers
> > > format definition files.
> > >
> > > My rationale is that the Rust implementation has a copy of the `format`
> > > directory [1] and potential drift worries me (a bit). Having a single
> > > source of truth for the format that is not part of the large mono repo
> > > would be a good thing.
> > >
> > > Andrew
> > >
> > > [1] https://github.com/apache/arrow-rs/tree/master/format
> > >
> > > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud  wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to bring up an idea from a recent thread ([1]) about moving
> > the
> > > > `format/` directory out of the primary apache/arrow repository.
> > > >
> > > > I understand from that thread there are some concerns about using
> > > > submodules,
> > > > and I definitely sympathize with those concerns.
> > > >
> > > > In talking with David Li (disclaimer: we work together at Voltron
> > Data), he
> > > > has
> > > > a great idea that I think makes everyone happy: an
> > `apache/arrow-format`
> > > > repository that is the official mirror for the flatbuffers IDL, that
> > > > library
> > > > authors should use as the source of truth.
> > > >
> > > > It doesn't require a submodule, yet it also allows external projects
> > the
> > > > ability to access the IDL without having to interact with the main
> > arrow
> > > > repository and is backwards compatible to boot.
> > > >
> > > > In this scenario, repositories that are currently copying in the
> > > > flatbuffers
> > > > IDL can migrate to this repository at their leisure.
> > > >
> > > > My motivation for this was around sharing data structures for the
> > compute
> > > > IR
> > > > proposal ([2]).
> > > >
> > > > I can think of at least two ways for IR producers and consumers of all
> > > > languages to share the flatbuffers IDL:
> > > >
> > > > 1. A set of bindings built in some language that other languages can
> > > > integrate
> > > >with, likely C++, that allows library users to build IR using an
> > API.
> > > >
> > > > The primary downside to this is that we'd have to deal with
> > > > building another library while working out any kinks in the IR design
> > and
> > > > I'd
> > > > rather avoid that in the initial phases of this project.
> > > >
> > > > The benefit is that IR components don't interact much with
> > `flatbuffers` or
> > > > `flatc` directly.
> > > >
> > > > 2. A single location 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Neal Richardson
> Maintain this "Arrow types and ComputeIR library" as an always
zero-dependency library to facilitate vendoring

Would/should this hypothetical zero-dep, vendorable library also include
the IPC format? Or if you want to interact with IPC in that case, the C
data interface is the best/only option?

On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney  wrote:

> It seems that one adjacent problem here is how to make it simpler for
> third parties (especially ones that act as front end interfaces) to
> build and serialize/deserialize the IR structures with some kind of
> ready-to-go middleware library, written in a language like C++.
>
> To do that, one would need the equivalent of arrow/type.h and related
> Flatbuffers schema serialization code that lives in arrow/ipc. If you
> want to be able to completely and accurately serialize Schemas, you
> need quite a bit of code now.
>
> One possible approach (and not go crazy) would be to:
>
> * Move arrow/types.h and its dependencies into a standalone C++
> library that can be vendored into the main apache/arrow C++ library. I
> don't know how onerous arrow/types.h's transitive dependencies /
> interactions are at this point (there's a lot of stuff going on in
> type.cc [1] now)
> * Make the namespaces exported by this library configurable, so any
> library can vendor the Arrow types / IR builder APIs privately into
> their project
> * Maintain this "Arrow types and ComputeIR library" as an always
> zero-dependency library to facilitate vendoring
> * Lightweight bindings in languages we care about (like Python or R or
> GLib/Ruby) could be built to the IR builder middleware library
>
> This seems like what is more at issue compared with rather projects
> are copying the Flatbuffers files out of their project from
> apache/arrow or apache/arrow-format.
>
> [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
>
> On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb  wrote:
> >
> > I support the idea of an independent repo that has the arrow flatbuffers
> > format definition files.
> >
> > My rationale is that the Rust implementation has a copy of the `format`
> > directory [1] and potential drift worries me (a bit). Having a single
> > source of truth for the format that is not part of the large mono repo
> > would be a good thing.
> >
> > Andrew
> >
> > [1] https://github.com/apache/arrow-rs/tree/master/format
> >
> > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud  wrote:
> >
> > > Hi all,
> > >
> > > I'd like to bring up an idea from a recent thread ([1]) about moving
> the
> > > `format/` directory out of the primary apache/arrow repository.
> > >
> > > I understand from that thread there are some concerns about using
> > > submodules,
> > > and I definitely sympathize with those concerns.
> > >
> > > In talking with David Li (disclaimer: we work together at Voltron
> Data), he
> > > has
> > > a great idea that I think makes everyone happy: an
> `apache/arrow-format`
> > > repository that is the official mirror for the flatbuffers IDL, that
> > > library
> > > authors should use as the source of truth.
> > >
> > > It doesn't require a submodule, yet it also allows external projects
> the
> > > ability to access the IDL without having to interact with the main
> arrow
> > > repository and is backwards compatible to boot.
> > >
> > > In this scenario, repositories that are currently copying in the
> > > flatbuffers
> > > IDL can migrate to this repository at their leisure.
> > >
> > > My motivation for this was around sharing data structures for the
> compute
> > > IR
> > > proposal ([2]).
> > >
> > > I can think of at least two ways for IR producers and consumers of all
> > > languages to share the flatbuffers IDL:
> > >
> > > 1. A set of bindings built in some language that other languages can
> > > integrate
> > >with, likely C++, that allows library users to build IR using an
> API.
> > >
> > > The primary downside to this is that we'd have to deal with
> > > building another library while working out any kinks in the IR design
> and
> > > I'd
> > > rather avoid that in the initial phases of this project.
> > >
> > > The benefit is that IR components don't interact much with
> `flatbuffers` or
> > > `flatc` directly.
> > >
> > > 2. A single location where the format lives, that doesn't require
> depending
> > > on
> > >a large multi-language repository to access a handful of files.
> > >
> > > I think the downside to this is that there's a bit of additional
> > > infrastructure
> > > to automate copying in `arrow-format`.
> > >
> > > The benefit there is that producers and consumers can immediately start
> > > getting
> > > value from compute IR without having to wait for development of a new
> API.
> > >
> > > One counter-proposal might be to just put the compute IR IDL in a
> separate
> > > repo,
> > > but that isn't tenable because the compute IR needs arrow's type
> > > information
> > > contained in `Schema.fbs`.
> > >
> > > I was hoping to avoid 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Phillip Cloud
On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney  wrote:

> It seems that one adjacent problem here is how to make it simpler for
> third parties (especially ones that act as front end interfaces) to
> build and serialize/deserialize the IR structures with some kind of
> ready-to-go middleware library, written in a language like C++.
>
> To do that, one would need the equivalent of arrow/type.h and related
> Flatbuffers schema serialization code that lives in arrow/ipc. If you
> want to be able to completely and accurately serialize Schemas, you
> need quite a bit of code now.
>
> One possible approach (and not go crazy) would be to:
>
> * Move arrow/types.h and its dependencies into a standalone C++
> library that can be vendored into the main apache/arrow C++ library. I
> don't know how onerous arrow/types.h's transitive dependencies /
> interactions are at this point (there's a lot of stuff going on in
> type.cc [1] now)
> * Make the namespaces exported by this library configurable, so any
> library can vendor the Arrow types / IR builder APIs privately into
> their project
> * Maintain this "Arrow types and ComputeIR library" as an always
> zero-dependency library to facilitate vendoring
> * Lightweight bindings in languages we care about (like Python or R or
> GLib/Ruby) could be built to the IR builder middleware library
>
> This seems like what is more at issue compared with rather projects
> are copying the Flatbuffers files out of their project from
> apache/arrow or apache/arrow-format.


I was hoping we could avoid doing something like this until there's a clear
need for it in the interest of not spending a huge amount of time on
adjacent
dependency management work.

My thinking is that the primary effort should be around solidifying the IR
design,
and not making it impossible for folks to test but also not spending a
bunch of
time up front building a middleware library.

I think the use case of simplifying external consumption of the arrow format
might even deserve its own dedicated mailing list thread.


>


> [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
>
> On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb  wrote:
> >
> > I support the idea of an independent repo that has the arrow flatbuffers
> > format definition files.
> >
> > My rationale is that the Rust implementation has a copy of the `format`
> > directory [1] and potential drift worries me (a bit). Having a single
> > source of truth for the format that is not part of the large mono repo
> > would be a good thing.
> >
> > Andrew
> >
> > [1] https://github.com/apache/arrow-rs/tree/master/format
> >
> > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud  wrote:
> >
> > > Hi all,
> > >
> > > I'd like to bring up an idea from a recent thread ([1]) about moving
> the
> > > `format/` directory out of the primary apache/arrow repository.
> > >
> > > I understand from that thread there are some concerns about using
> > > submodules,
> > > and I definitely sympathize with those concerns.
> > >
> > > In talking with David Li (disclaimer: we work together at Voltron
> Data), he
> > > has
> > > a great idea that I think makes everyone happy: an
> `apache/arrow-format`
> > > repository that is the official mirror for the flatbuffers IDL, that
> > > library
> > > authors should use as the source of truth.
> > >
> > > It doesn't require a submodule, yet it also allows external projects
> the
> > > ability to access the IDL without having to interact with the main
> arrow
> > > repository and is backwards compatible to boot.
> > >
> > > In this scenario, repositories that are currently copying in the
> > > flatbuffers
> > > IDL can migrate to this repository at their leisure.
> > >
> > > My motivation for this was around sharing data structures for the
> compute
> > > IR
> > > proposal ([2]).
> > >
> > > I can think of at least two ways for IR producers and consumers of all
> > > languages to share the flatbuffers IDL:
> > >
> > > 1. A set of bindings built in some language that other languages can
> > > integrate
> > >with, likely C++, that allows library users to build IR using an
> API.
> > >
> > > The primary downside to this is that we'd have to deal with
> > > building another library while working out any kinks in the IR design
> and
> > > I'd
> > > rather avoid that in the initial phases of this project.
> > >
> > > The benefit is that IR components don't interact much with
> `flatbuffers` or
> > > `flatc` directly.
> > >
> > > 2. A single location where the format lives, that doesn't require
> depending
> > > on
> > >a large multi-language repository to access a handful of files.
> > >
> > > I think the downside to this is that there's a bit of additional
> > > infrastructure
> > > to automate copying in `arrow-format`.
> > >
> > > The benefit there is that producers and consumers can immediately start
> > > getting
> > > value from compute IR without having to wait for development of a new
> API.
> > >
> > > One 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Wes McKinney
It seems that one adjacent problem here is how to make it simpler for
third parties (especially ones that act as front end interfaces) to
build and serialize/deserialize the IR structures with some kind of
ready-to-go middleware library, written in a language like C++.

To do that, one would need the equivalent of arrow/type.h and related
Flatbuffers schema serialization code that lives in arrow/ipc. If you
want to be able to completely and accurately serialize Schemas, you
need quite a bit of code now.

One possible approach (and not go crazy) would be to:

* Move arrow/types.h and its dependencies into a standalone C++
library that can be vendored into the main apache/arrow C++ library. I
don't know how onerous arrow/types.h's transitive dependencies /
interactions are at this point (there's a lot of stuff going on in
type.cc [1] now)
* Make the namespaces exported by this library configurable, so any
library can vendor the Arrow types / IR builder APIs privately into
their project
* Maintain this "Arrow types and ComputeIR library" as an always
zero-dependency library to facilitate vendoring
* Lightweight bindings in languages we care about (like Python or R or
GLib/Ruby) could be built to the IR builder middleware library

This seems like what is more at issue compared with rather projects
are copying the Flatbuffers files out of their project from
apache/arrow or apache/arrow-format.

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc

On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb  wrote:
>
> I support the idea of an independent repo that has the arrow flatbuffers
> format definition files.
>
> My rationale is that the Rust implementation has a copy of the `format`
> directory [1] and potential drift worries me (a bit). Having a single
> source of truth for the format that is not part of the large mono repo
> would be a good thing.
>
> Andrew
>
> [1] https://github.com/apache/arrow-rs/tree/master/format
>
> On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud  wrote:
>
> > Hi all,
> >
> > I'd like to bring up an idea from a recent thread ([1]) about moving the
> > `format/` directory out of the primary apache/arrow repository.
> >
> > I understand from that thread there are some concerns about using
> > submodules,
> > and I definitely sympathize with those concerns.
> >
> > In talking with David Li (disclaimer: we work together at Voltron Data), he
> > has
> > a great idea that I think makes everyone happy: an `apache/arrow-format`
> > repository that is the official mirror for the flatbuffers IDL, that
> > library
> > authors should use as the source of truth.
> >
> > It doesn't require a submodule, yet it also allows external projects the
> > ability to access the IDL without having to interact with the main arrow
> > repository and is backwards compatible to boot.
> >
> > In this scenario, repositories that are currently copying in the
> > flatbuffers
> > IDL can migrate to this repository at their leisure.
> >
> > My motivation for this was around sharing data structures for the compute
> > IR
> > proposal ([2]).
> >
> > I can think of at least two ways for IR producers and consumers of all
> > languages to share the flatbuffers IDL:
> >
> > 1. A set of bindings built in some language that other languages can
> > integrate
> >with, likely C++, that allows library users to build IR using an API.
> >
> > The primary downside to this is that we'd have to deal with
> > building another library while working out any kinks in the IR design and
> > I'd
> > rather avoid that in the initial phases of this project.
> >
> > The benefit is that IR components don't interact much with `flatbuffers` or
> > `flatc` directly.
> >
> > 2. A single location where the format lives, that doesn't require depending
> > on
> >a large multi-language repository to access a handful of files.
> >
> > I think the downside to this is that there's a bit of additional
> > infrastructure
> > to automate copying in `arrow-format`.
> >
> > The benefit there is that producers and consumers can immediately start
> > getting
> > value from compute IR without having to wait for development of a new API.
> >
> > One counter-proposal might be to just put the compute IR IDL in a separate
> > repo,
> > but that isn't tenable because the compute IR needs arrow's type
> > information
> > contained in `Schema.fbs`.
> >
> > I was hoping to avoid conflating the discussion about bindings vs direct
> > flatbuffer usage (at least initially just supporting one, I predict we'll
> > need
> > both ultimately) with the decision about whether to split out the format
> > directory, but it's a good example of a choice for which splitting out the
> > format directory would be well-served.
> >
> > I'll note that this doesn't block anything on the compute IR side, just
> > wanted
> > to surface this in a parallel thread and see what folks think.
> >
> > [1]:
> >
> > 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-12 Thread Andrew Lamb
I support the idea of an independent repo that has the arrow flatbuffers
format definition files.

My rationale is that the Rust implementation has a copy of the `format`
directory [1] and potential drift worries me (a bit). Having a single
source of truth for the format that is not part of the large mono repo
would be a good thing.

Andrew

[1] https://github.com/apache/arrow-rs/tree/master/format

On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud  wrote:

> Hi all,
>
> I'd like to bring up an idea from a recent thread ([1]) about moving the
> `format/` directory out of the primary apache/arrow repository.
>
> I understand from that thread there are some concerns about using
> submodules,
> and I definitely sympathize with those concerns.
>
> In talking with David Li (disclaimer: we work together at Voltron Data), he
> has
> a great idea that I think makes everyone happy: an `apache/arrow-format`
> repository that is the official mirror for the flatbuffers IDL, that
> library
> authors should use as the source of truth.
>
> It doesn't require a submodule, yet it also allows external projects the
> ability to access the IDL without having to interact with the main arrow
> repository and is backwards compatible to boot.
>
> In this scenario, repositories that are currently copying in the
> flatbuffers
> IDL can migrate to this repository at their leisure.
>
> My motivation for this was around sharing data structures for the compute
> IR
> proposal ([2]).
>
> I can think of at least two ways for IR producers and consumers of all
> languages to share the flatbuffers IDL:
>
> 1. A set of bindings built in some language that other languages can
> integrate
>with, likely C++, that allows library users to build IR using an API.
>
> The primary downside to this is that we'd have to deal with
> building another library while working out any kinks in the IR design and
> I'd
> rather avoid that in the initial phases of this project.
>
> The benefit is that IR components don't interact much with `flatbuffers` or
> `flatc` directly.
>
> 2. A single location where the format lives, that doesn't require depending
> on
>a large multi-language repository to access a handful of files.
>
> I think the downside to this is that there's a bit of additional
> infrastructure
> to automate copying in `arrow-format`.
>
> The benefit there is that producers and consumers can immediately start
> getting
> value from compute IR without having to wait for development of a new API.
>
> One counter-proposal might be to just put the compute IR IDL in a separate
> repo,
> but that isn't tenable because the compute IR needs arrow's type
> information
> contained in `Schema.fbs`.
>
> I was hoping to avoid conflating the discussion about bindings vs direct
> flatbuffer usage (at least initially just supporting one, I predict we'll
> need
> both ultimately) with the decision about whether to split out the format
> directory, but it's a good example of a choice for which splitting out the
> format directory would be well-served.
>
> I'll note that this doesn't block anything on the compute IR side, just
> wanted
> to surface this in a parallel thread and see what folks think.
>
> [1]:
>
> https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E
> [2]:
>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021, 19:05 Weston Pace  wrote:

> >> The benefit is that IR components don't interact much with
> `flatbuffers` or
> >> `flatc` directly.
> >>
> [...]
> >>
> >> One counter-proposal might be to just put the compute IR IDL in a
> separate
> >> repo,
> >> but that isn't tenable because the compute IR needs arrow's type
> information
> >> contained in `Schema.fbs`.
>
> > This argument seems predated on the hypothesis that the compute IR will
> > use Flatbuffers.  Is it set in stone?
>
> +1 for the original proposal (mirror repo for specs).  I don't think
> we have to figure out the IR format.  It makes sense for all language
> independent specs to be in a single place regardless of format.  If IR
> picked JSON I would still argue the JSON schemas for IR belong in the
> same repository as the Arrow columnar format flatbuffers files.  It
> makes it clear what is spec and what is implementation / toolkit.
> Especially since a mirror repo should be pretty low maintenance.
>

That's a good point. I hadn't considered that point of view, but I think
you're right that specs, regardless of wire format should remain together.


> On Wed, Aug 11, 2021 at 11:34 AM Antoine Pitrou 
> wrote:
> >
> >
> > Le 11/08/2021 à 23:06, Phillip Cloud a écrit :
> > > On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou 
> wrote:
> > >
> > >> Le 11/08/2021 à 22:16, Phillip Cloud a écrit :
> > >>>
> > >>> Yeah, that is a drawback here, though I don't see needing to run
> flatc
> > >> as a
> > >>> major downside given the upside
> > >>> of not having to write additional code to move between formats.
> > >>
> > >> That's only an advantage if you already know how to read the Arrow IPC
> > >> format (and, yes, in this case you already run `flatc`).  Some
> projects
> > >> probably don't care about Arrow IPC (Dask, for example).
> > >
> > >
> > > I don't think it's about the IPC though, at least for the compute IR
> use
> > > case.
> > > Am I missing something there?
> >
> > If you're not handling the Arrow IPC format, then you probably don't
> > have an encoder/decoder for Schema.fbs, so the "upside of not having to
> > write additional code to move between formats" doesn't exist (unless I'm
> > misunderstanding your point?).
> >
> > > I do think a downside of not using something like JSON or msgpack is
> > > that schema validation must be implemented by both the producer and the
> > > consumer.
> > > That means we'd have a number of other consequential decisions to make:
> > >
> > > * Do we provide the validation library?
> > > * If not, do all the languages arrow supports have high quality
> libraries
> > > for validating schemas?
> > > * If so, then we have to implement/maintain/release/bugfix that.
> >
> > This is true.  However, Flatbuffers doesn't validate much on its own,
> > either, because its IDL is not expressive enough.  For example,
> > `Schema.fbs` allows you to declare a INT8 field with children, a LIST
> > field without any children, a non-nullable NULL field...
> >
> > (also, there's JSON Schema: https://json-schema.org/)
> >
> > Regards
> >
> > Antoine.
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Weston Pace
>> The benefit is that IR components don't interact much with `flatbuffers` or
>> `flatc` directly.
>>
[...]
>>
>> One counter-proposal might be to just put the compute IR IDL in a separate
>> repo,
>> but that isn't tenable because the compute IR needs arrow's type information
>> contained in `Schema.fbs`.

> This argument seems predated on the hypothesis that the compute IR will
> use Flatbuffers.  Is it set in stone?

+1 for the original proposal (mirror repo for specs).  I don't think
we have to figure out the IR format.  It makes sense for all language
independent specs to be in a single place regardless of format.  If IR
picked JSON I would still argue the JSON schemas for IR belong in the
same repository as the Arrow columnar format flatbuffers files.  It
makes it clear what is spec and what is implementation / toolkit.
Especially since a mirror repo should be pretty low maintenance.

On Wed, Aug 11, 2021 at 11:34 AM Antoine Pitrou  wrote:
>
>
> Le 11/08/2021 à 23:06, Phillip Cloud a écrit :
> > On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou  wrote:
> >
> >> Le 11/08/2021 à 22:16, Phillip Cloud a écrit :
> >>>
> >>> Yeah, that is a drawback here, though I don't see needing to run flatc
> >> as a
> >>> major downside given the upside
> >>> of not having to write additional code to move between formats.
> >>
> >> That's only an advantage if you already know how to read the Arrow IPC
> >> format (and, yes, in this case you already run `flatc`).  Some projects
> >> probably don't care about Arrow IPC (Dask, for example).
> >
> >
> > I don't think it's about the IPC though, at least for the compute IR use
> > case.
> > Am I missing something there?
>
> If you're not handling the Arrow IPC format, then you probably don't
> have an encoder/decoder for Schema.fbs, so the "upside of not having to
> write additional code to move between formats" doesn't exist (unless I'm
> misunderstanding your point?).
>
> > I do think a downside of not using something like JSON or msgpack is
> > that schema validation must be implemented by both the producer and the
> > consumer.
> > That means we'd have a number of other consequential decisions to make:
> >
> > * Do we provide the validation library?
> > * If not, do all the languages arrow supports have high quality libraries
> > for validating schemas?
> > * If so, then we have to implement/maintain/release/bugfix that.
>
> This is true.  However, Flatbuffers doesn't validate much on its own,
> either, because its IDL is not expressive enough.  For example,
> `Schema.fbs` allows you to declare a INT8 field with children, a LIST
> field without any children, a non-nullable NULL field...
>
> (also, there's JSON Schema: https://json-schema.org/)
>
> Regards
>
> Antoine.


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou



Le 11/08/2021 à 23:06, Phillip Cloud a écrit :

On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou  wrote:


Le 11/08/2021 à 22:16, Phillip Cloud a écrit :


Yeah, that is a drawback here, though I don't see needing to run flatc

as a

major downside given the upside
of not having to write additional code to move between formats.


That's only an advantage if you already know how to read the Arrow IPC
format (and, yes, in this case you already run `flatc`).  Some projects
probably don't care about Arrow IPC (Dask, for example).



I don't think it's about the IPC though, at least for the compute IR use
case.
Am I missing something there?


If you're not handling the Arrow IPC format, then you probably don't 
have an encoder/decoder for Schema.fbs, so the "upside of not having to 
write additional code to move between formats" doesn't exist (unless I'm 
misunderstanding your point?).



I do think a downside of not using something like JSON or msgpack is
that schema validation must be implemented by both the producer and the
consumer.
That means we'd have a number of other consequential decisions to make:

* Do we provide the validation library?
* If not, do all the languages arrow supports have high quality libraries
for validating schemas?
* If so, then we have to implement/maintain/release/bugfix that.


This is true.  However, Flatbuffers doesn't validate much on its own, 
either, because its IDL is not expressive enough.  For example, 
`Schema.fbs` allows you to declare a INT8 field with children, a LIST 
field without any children, a non-nullable NULL field...


(also, there's JSON Schema: https://json-schema.org/)

Regards

Antoine.


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:21 PM David Li  wrote:

> If the worry is public distribution (i.e. requiring all downstream
> projects to also run flatc in their builds) we could perhaps ship a package
> that just consists of the generated code (though that's definitely more
> packaging burden, and won't help when you're doing development against
> in-progress or unreleased changes).
>
> -David
>

Arrow need not take on yet another packaging burden here: library authors
can run flatc during development and release cycles, and ship that code
alongside (whatever that means for the specific language) their library
code. End users of, say, ibis never need to think about having flatc around.


>
> On Wed, Aug 11, 2021, at 16:16, Phillip Cloud wrote:
> > On Wed, Aug 11, 2021 at 4:05 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 11/08/2021 à 22:02, Phillip Cloud a écrit :
> > > > On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou 
> > > wrote:
> > > >
> > > >>
> > > >> Le 11/08/2021 à 21:56, Phillip Cloud a écrit :
> > > >>> I can see how that might be a bit circular. Let me start from the
> > > >>> perspective of requirements. We want to be able to reuse the
> arrow's
> > > >> types
> > > >>> and schema, without having to write additional code to move back
> and
> > > >> forth
> > > >>> between compute IR and not-compute-IR. I think that leaves only
> > > >> flatbuffers
> > > >>> as an option.
> > > >>
> > > >> If that's the case then agreed (well, you can always embed as a raw
> > > >> bytestring in other formats, but that wouldn't be pretty).
> > > >>
> > > >> I just wonder what the complexity of using Flatbuffers is for e.g.
> > > Python.
> > > >>
> > > >
> > > > IMO the complexity isn't high, but the generated code is definitely
> not
> > > > idiomatic (
> > > > https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html
> )
> > >
> > > Wow. And you also have to integrate `flatc` in your build chain?
> > >
> >
> > Yeah, that is a drawback here, though I don't see needing to run flatc
> as a
> > major downside given the upside
> > of not having to write additional code to move between formats.
> >
> > Is there something particularly onerous about needing to run a codegen
> step
> > in a build process
> > (other than it being build-step number 1000 in a death by 1000
> build-steps
> > scenario)?
> >
> >
> > >
> > > IMHO that compares poorly to JSON or MsgPack, for example.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> >
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou  wrote:

>
> Le 11/08/2021 à 22:16, Phillip Cloud a écrit :
> >
> > Yeah, that is a drawback here, though I don't see needing to run flatc
> as a
> > major downside given the upside
> > of not having to write additional code to move between formats.
>
> That's only an advantage if you already know how to read the Arrow IPC
> format (and, yes, in this case you already run `flatc`).  Some projects
> probably don't care about Arrow IPC (Dask, for example).


I don't think it's about the IPC though, at least for the compute IR use
case.
Am I missing something there?

I do think a downside of not using something like JSON or msgpack is
that schema validation must be implemented by both the producer and the
consumer.
That means we'd have a number of other consequential decisions to make:

* Do we provide the validation library?
* If not, do all the languages arrow supports have high quality libraries
for validating schemas?
* If so, then we have to implement/maintain/release/bugfix that.

This isn't the case with fb or protos since they have done the work
to produce
valid schemas by definition.


>
> > Is there something particularly onerous about needing to run a codegen
> step
> > in a build process
> > (other than it being build-step number 1000 in a death by 1000
> build-steps
> > scenario)?
>
> Most Python packages (except perhaps Numpy, Pandas, PyArrow...) have a
> very simple build configuration.  Adding an external command in the mix
> (that needs a non-standard dependency) isn't trivial.
>

I don't find this too compelling. One language's lack of modern dependency
management tooling and refusal to make it easy to run external tools during
that process doesn't seem like a strong reason to rule out flatbuffers here.

I want to support everyone as best we can, but any choice we make here
will have some tradeoffs. I see not being able to share the exact same
schema and type information as a huge downside relative to the cost
of having to run a binary during a build process.

To be clear, users should _definitely_ not be running flatc, it's only
library
authors that should be running it as part of a development/build/release
cycle.


>
> Regards
>
> Antoine.
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou



Le 11/08/2021 à 22:20, David Li a écrit :

If the worry is public distribution (i.e. requiring all downstream projects to 
also run flatc in their builds) we could perhaps ship a package that just 
consists of the generated code (though that's definitely more packaging burden, 
and won't help when you're doing development against in-progress or unreleased 
changes).


Yes, we can do that.  And in this case, we can even probably hide the 
Flatbuffers objects behind a more idiomatic API (such as nested dicts in 
Python).


Regards

Antoine.


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou



Le 11/08/2021 à 22:16, Phillip Cloud a écrit :


Yeah, that is a drawback here, though I don't see needing to run flatc as a
major downside given the upside
of not having to write additional code to move between formats.


That's only an advantage if you already know how to read the Arrow IPC 
format (and, yes, in this case you already run `flatc`).  Some projects 
probably don't care about Arrow IPC (Dask, for example).



Is there something particularly onerous about needing to run a codegen step
in a build process
(other than it being build-step number 1000 in a death by 1000 build-steps
scenario)?


Most Python packages (except perhaps Numpy, Pandas, PyArrow...) have a 
very simple build configuration.  Adding an external command in the mix 
(that needs a non-standard dependency) isn't trivial.


Regards

Antoine.


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread David Li
If the worry is public distribution (i.e. requiring all downstream projects to 
also run flatc in their builds) we could perhaps ship a package that just 
consists of the generated code (though that's definitely more packaging burden, 
and won't help when you're doing development against in-progress or unreleased 
changes).

-David

On Wed, Aug 11, 2021, at 16:16, Phillip Cloud wrote:
> On Wed, Aug 11, 2021 at 4:05 PM Antoine Pitrou  wrote:
> 
> >
> > Le 11/08/2021 à 22:02, Phillip Cloud a écrit :
> > > On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou 
> > wrote:
> > >
> > >>
> > >> Le 11/08/2021 à 21:56, Phillip Cloud a écrit :
> > >>> I can see how that might be a bit circular. Let me start from the
> > >>> perspective of requirements. We want to be able to reuse the arrow's
> > >> types
> > >>> and schema, without having to write additional code to move back and
> > >> forth
> > >>> between compute IR and not-compute-IR. I think that leaves only
> > >> flatbuffers
> > >>> as an option.
> > >>
> > >> If that's the case then agreed (well, you can always embed as a raw
> > >> bytestring in other formats, but that wouldn't be pretty).
> > >>
> > >> I just wonder what the complexity of using Flatbuffers is for e.g.
> > Python.
> > >>
> > >
> > > IMO the complexity isn't high, but the generated code is definitely not
> > > idiomatic (
> > > https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html)
> >
> > Wow. And you also have to integrate `flatc` in your build chain?
> >
> 
> Yeah, that is a drawback here, though I don't see needing to run flatc as a
> major downside given the upside
> of not having to write additional code to move between formats.
> 
> Is there something particularly onerous about needing to run a codegen step
> in a build process
> (other than it being build-step number 1000 in a death by 1000 build-steps
> scenario)?
> 
> 
> >
> > IMHO that compares poorly to JSON or MsgPack, for example.
> >
> > Regards
> >
> > Antoine.
> >
> 


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:05 PM Antoine Pitrou  wrote:

>
> Le 11/08/2021 à 22:02, Phillip Cloud a écrit :
> > On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 11/08/2021 à 21:56, Phillip Cloud a écrit :
> >>> I can see how that might be a bit circular. Let me start from the
> >>> perspective of requirements. We want to be able to reuse the arrow's
> >> types
> >>> and schema, without having to write additional code to move back and
> >> forth
> >>> between compute IR and not-compute-IR. I think that leaves only
> >> flatbuffers
> >>> as an option.
> >>
> >> If that's the case then agreed (well, you can always embed as a raw
> >> bytestring in other formats, but that wouldn't be pretty).
> >>
> >> I just wonder what the complexity of using Flatbuffers is for e.g.
> Python.
> >>
> >
> > IMO the complexity isn't high, but the generated code is definitely not
> > idiomatic (
> > https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html)
>
> Wow. And you also have to integrate `flatc` in your build chain?
>

Yeah, that is a drawback here, though I don't see needing to run flatc as a
major downside given the upside
of not having to write additional code to move between formats.

Is there something particularly onerous about needing to run a codegen step
in a build process
(other than it being build-step number 1000 in a death by 1000 build-steps
scenario)?


>
> IMHO that compares poorly to JSON or MsgPack, for example.
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou



Le 11/08/2021 à 22:02, Phillip Cloud a écrit :

On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou  wrote:



Le 11/08/2021 à 21:56, Phillip Cloud a écrit :

I can see how that might be a bit circular. Let me start from the
perspective of requirements. We want to be able to reuse the arrow's

types

and schema, without having to write additional code to move back and

forth

between compute IR and not-compute-IR. I think that leaves only

flatbuffers

as an option.


If that's the case then agreed (well, you can always embed as a raw
bytestring in other formats, but that wouldn't be pretty).

I just wonder what the complexity of using Flatbuffers is for e.g. Python.



IMO the complexity isn't high, but the generated code is definitely not
idiomatic (
https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html)


Wow. And you also have to integrate `flatc` in your build chain?

IMHO that compares poorly to JSON or MsgPack, for example.

Regards

Antoine.


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou  wrote:

>
> Le 11/08/2021 à 21:56, Phillip Cloud a écrit :
> > I can see how that might be a bit circular. Let me start from the
> > perspective of requirements. We want to be able to reuse the arrow's
> types
> > and schema, without having to write additional code to move back and
> forth
> > between compute IR and not-compute-IR. I think that leaves only
> flatbuffers
> > as an option.
>
> If that's the case then agreed (well, you can always embed as a raw
> bytestring in other formats, but that wouldn't be pretty).
>
> I just wonder what the complexity of using Flatbuffers is for e.g. Python.
>

IMO the complexity isn't high, but the generated code is definitely not
idiomatic (
https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html)


>
> Regards
>
> Antoine.
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou



Le 11/08/2021 à 21:56, Phillip Cloud a écrit :

I can see how that might be a bit circular. Let me start from the
perspective of requirements. We want to be able to reuse the arrow's types
and schema, without having to write additional code to move back and forth
between compute IR and not-compute-IR. I think that leaves only flatbuffers
as an option.


If that's the case then agreed (well, you can always embed as a raw 
bytestring in other formats, but that wouldn't be pretty).


I just wonder what the complexity of using Flatbuffers is for e.g. Python.

Regards

Antoine.


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
I can see how that might be a bit circular. Let me start from the
perspective of requirements. We want to be able to reuse the arrow's types
and schema, without having to write additional code to move back and forth
between compute IR and not-compute-IR. I think that leaves only flatbuffers
as an option.

On Wed, Aug 11, 2021 at 3:52 PM Phillip Cloud  wrote:

> On Wed, Aug 11, 2021 at 3:51 PM Antoine Pitrou  wrote:
>
>>
>>
>> Le 11/08/2021 à 21:39, Phillip Cloud a écrit :
>> > The benefit is that IR components don't interact much with
>> `flatbuffers` or
>> > `flatc` directly.
>> >
>> [...]
>> >
>> > One counter-proposal might be to just put the compute IR IDL in a
>> separate
>> > repo,
>> > but that isn't tenable because the compute IR needs arrow's type
>> information
>> > contained in `Schema.fbs`.
>>
>> This argument seems predated on the hypothesis that the compute IR will
>> use Flatbuffers.  Is it set in stone?
>>
>
> It's not set in stone, but so far it's the leading contender due to the
> need to share elements of Schema.fbs.
>
>
>>
>> Regards
>>
>> Antoine.
>>
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 3:51 PM Antoine Pitrou  wrote:

>
>
> Le 11/08/2021 à 21:39, Phillip Cloud a écrit :
> > The benefit is that IR components don't interact much with `flatbuffers`
> or
> > `flatc` directly.
> >
> [...]
> >
> > One counter-proposal might be to just put the compute IR IDL in a
> separate
> > repo,
> > but that isn't tenable because the compute IR needs arrow's type
> information
> > contained in `Schema.fbs`.
>
> This argument seems predated on the hypothesis that the compute IR will
> use Flatbuffers.  Is it set in stone?
>

It's not set in stone, but so far it's the leading contender due to the
need to share elements of Schema.fbs.


>
> Regards
>
> Antoine.
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Antoine Pitrou




Le 11/08/2021 à 21:39, Phillip Cloud a écrit :

The benefit is that IR components don't interact much with `flatbuffers` or
`flatc` directly.


[...]


One counter-proposal might be to just put the compute IR IDL in a separate
repo,
but that isn't tenable because the compute IR needs arrow's type information
contained in `Schema.fbs`.


This argument seems predated on the hypothesis that the compute IR will 
use Flatbuffers.  Is it set in stone?


Regards

Antoine.


[DISCUSS] Splitting out the Arrow format directory

2021-08-11 Thread Phillip Cloud
Hi all,

I'd like to bring up an idea from a recent thread ([1]) about moving the
`format/` directory out of the primary apache/arrow repository.

I understand from that thread there are some concerns about using
submodules,
and I definitely sympathize with those concerns.

In talking with David Li (disclaimer: we work together at Voltron Data), he
has
a great idea that I think makes everyone happy: an `apache/arrow-format`
repository that is the official mirror for the flatbuffers IDL, that library
authors should use as the source of truth.

It doesn't require a submodule, yet it also allows external projects the
ability to access the IDL without having to interact with the main arrow
repository and is backwards compatible to boot.

In this scenario, repositories that are currently copying in the flatbuffers
IDL can migrate to this repository at their leisure.

My motivation for this was around sharing data structures for the compute IR
proposal ([2]).

I can think of at least two ways for IR producers and consumers of all
languages to share the flatbuffers IDL:

1. A set of bindings built in some language that other languages can
integrate
   with, likely C++, that allows library users to build IR using an API.

The primary downside to this is that we'd have to deal with
building another library while working out any kinks in the IR design and
I'd
rather avoid that in the initial phases of this project.

The benefit is that IR components don't interact much with `flatbuffers` or
`flatc` directly.

2. A single location where the format lives, that doesn't require depending
on
   a large multi-language repository to access a handful of files.

I think the downside to this is that there's a bit of additional
infrastructure
to automate copying in `arrow-format`.

The benefit there is that producers and consumers can immediately start
getting
value from compute IR without having to wait for development of a new API.

One counter-proposal might be to just put the compute IR IDL in a separate
repo,
but that isn't tenable because the compute IR needs arrow's type information
contained in `Schema.fbs`.

I was hoping to avoid conflating the discussion about bindings vs direct
flatbuffer usage (at least initially just supporting one, I predict we'll
need
both ultimately) with the decision about whether to split out the format
directory, but it's a good example of a choice for which splitting out the
format directory would be well-served.

I'll note that this doesn't block anything on the compute IR side, just
wanted
to surface this in a parallel thread and see what folks think.

[1]:
https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E
[2]:
https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l