Re: [DISCUSS] Splitting out the Arrow format directory

Jorge Cardoso Leitão Fri, 13 Aug 2021 05:03:35 -0700

Hi,

The requirements for the compute IR as I see it are:
>
> * Implementations in IR producer and consumer languages.
> * Strongly typed or the ability to easily validate a payload
>


What about:

1. easy to read and write by a large number of programming languages
2. easy to read and write by humans
3. fast to validate by a large number of programming languages

I.e. make the ability to read and write by humans be more important than
speed of validation.

In this order, JSON/toml/yaml are preferred because they are supported by
more languages and more human readable than faster to validate.

-----

My understanding is that for an async experience, we need the ability to
`.await` at any `read_X` call so that if the read_X requests more bytes
than are buffered, the `read_X(...).await` triggers a new (async) request
to fill the buffer (which puts the future on a Pending state). When a
library does not offer the async version of `read_X`, any read_X can force
a request to fill the buffer, which is now blocking the thread. One way
around this is to wrap those blocking calls in async (e.g. via
tokio::spawn_blocking). However, this forces users to use that runtime, or
to create a new independent thread pool for their own async work. Neither
are great for low-level libraries.

E.g. thrift does not offer async -> parquet-format-rs does not offer async
-> parquet does not offer async -> datafusion wraps all parquet "IO-bounded
and CPU-bounded operations" in spawn_blocking or something equivalent.

Best,
Jorge


On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud <cpcl...@gmail.com> wrote:

> On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > I agree with Antoine that we should weigh the pros and cons of
> flatbuffers
> > (or protobuf or thrift for that matter) over a more human-friendly,
> > simpler, format like json or MsgPack. I also struggle a bit to reason
> with
> > the complexity of using flatbuffers for this.
> >
>
> Ultimately I think different representations of the format will emerge if
> compute IR is successful,
> and people will implement JSON/proto/thrift/etc versions of the IR.
>
> The requirements for the compute IR as I see it are:
>
> * Implementations in IR producer and consumer languages.
> * Strongly typed or the ability to easily validate a payload
>
> It seems like Protobuf, Flatbuffers and JSON all meet the criteria here.
> Beyond that,
> there's precedence in the codebase for flatbuffers (which is just to say
> that flatbuffers
> is the devil we know).
>
> Can people list other concrete requirements for the format? A
> non-requirement might
> be that there be _idiomatic_ implementations for every language arrow
> supports, for example.
>
> I think without agreement on requirements we won't ever arrive at
> consensus.
>
> The compute IR spec itself doesn't really depend on the specific choice of
> format, but we
> need to get some consensus on the format.
>
>
> > E.g. there is no async support for thrift, flatbuffers nor protobuf in
> > Rust, which e.g. means that we can't read neither parquet nor arrow IPC
> > async atm. These problems are usually easier to work around in simpler
> > formats.
> >
>
> Can you elaborate a bit on the lack of async support here and what it would
> mean for
> a particular in-memory representation to support async, and why that
> prevents reading
> a parquet file using async?
>
> Looking at JSON as an example, most libraries in the Rust ecosystem use
> serde and serde_json
> to serialize and deserialize JSON, and any async concerns occur at the
> level of
> a client/server library like warp (or some transitive dependency thereof
> like Hyper).
>
> Are you referring to something like the functionality implemented in
> tokio-serde-json? If so,
> I think you could probably build something for these other formats assuming
> they have serde
> support (flatbuffers notably does _not_, partially because of its incessant
> need to own everything),
> since tokio_serde is doing most of the work in tokio-serde-json. In any
> case, I don't think
> it's a requirement for the compute IR that there be a streaming transport
> implementation for the
> format.
>
>
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Le 12/08/2021 à 15:05, Wes McKinney a écrit :
> > > > It seems that one adjacent problem here is how to make it simpler for
> > > > third parties (especially ones that act as front end interfaces) to
> > > > build and serialize/deserialize the IR structures with some kind of
> > > > ready-to-go middleware library, written in a language like C++.
> > >
> > > A C++ library sounds a bit complicated to deal with for Java, Rust, Go,
> > > etc. developers.
> > >
> > > I'm not sure which design decision and set of compromises would make
> the
> > > most sense.  But this is why I'm asking the question "why not JSON?" (+
> > > JSON-Schema if you want to ease validation by third parties).
> > >
> > > (note I have already mentioned MsgPack, but only in the case a binary
> > > encoding is really required; it doesn't have any other advantage that I
> > > know of over JSON, and it's less ubiquitous)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > > To do that, one would need the equivalent of arrow/type.h and related
> > > > Flatbuffers schema serialization code that lives in arrow/ipc. If you
> > > > want to be able to completely and accurately serialize Schemas, you
> > > > need quite a bit of code now.
> > > >
> > > > One possible approach (and not go crazy) would be to:
> > > >
> > > > * Move arrow/types.h and its dependencies into a standalone C++
> > > > library that can be vendored into the main apache/arrow C++ library.
> I
> > > > don't know how onerous arrow/types.h's transitive dependencies /
> > > > interactions are at this point (there's a lot of stuff going on in
> > > > type.cc [1] now)
> > > > * Make the namespaces exported by this library configurable, so any
> > > > library can vendor the Arrow types / IR builder APIs privately into
> > > > their project
> > > > * Maintain this "Arrow types and ComputeIR library" as an always
> > > > zero-dependency library to facilitate vendoring
> > > > * Lightweight bindings in languages we care about (like Python or R
> or
> > > > GLib/Ruby) could be built to the IR builder middleware library
> > > >
> > > > This seems like what is more at issue compared with rather projects
> > > > are copying the Flatbuffers files out of their project from
> > > > apache/arrow or apache/arrow-format.
> > > >
> > > > [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
> > > >
> > > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com>
> > > wrote:
> > > >>
> > > >> I support the idea of an independent repo that has the arrow
> > flatbuffers
> > > >> format definition files.
> > > >>
> > > >> My rationale is that the Rust implementation has a copy of the
> > `format`
> > > >> directory [1] and potential drift worries me (a bit). Having a
> single
> > > >> source of truth for the format that is not part of the large mono
> repo
> > > >> would be a good thing.
> > > >>
> > > >> Andrew
> > > >>
> > > >> [1] https://github.com/apache/arrow-rs/tree/master/format
> > > >>
> > > >> On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <cpcl...@gmail.com>
> > > wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> I'd like to bring up an idea from a recent thread ([1]) about
> moving
> > > the
> > > >>> `format/` directory out of the primary apache/arrow repository.
> > > >>>
> > > >>> I understand from that thread there are some concerns about using
> > > >>> submodules,
> > > >>> and I definitely sympathize with those concerns.
> > > >>>
> > > >>> In talking with David Li (disclaimer: we work together at Voltron
> > > Data), he
> > > >>> has
> > > >>> a great idea that I think makes everyone happy: an
> > > `apache/arrow-format`
> > > >>> repository that is the official mirror for the flatbuffers IDL,
> that
> > > >>> library
> > > >>> authors should use as the source of truth.
> > > >>>
> > > >>> It doesn't require a submodule, yet it also allows external
> projects
> > > the
> > > >>> ability to access the IDL without having to interact with the main
> > > arrow
> > > >>> repository and is backwards compatible to boot.
> > > >>>
> > > >>> In this scenario, repositories that are currently copying in the
> > > >>> flatbuffers
> > > >>> IDL can migrate to this repository at their leisure.
> > > >>>
> > > >>> My motivation for this was around sharing data structures for the
> > > compute
> > > >>> IR
> > > >>> proposal ([2]).
> > > >>>
> > > >>> I can think of at least two ways for IR producers and consumers of
> > all
> > > >>> languages to share the flatbuffers IDL:
> > > >>>
> > > >>> 1. A set of bindings built in some language that other languages
> can
> > > >>> integrate
> > > >>>     with, likely C++, that allows library users to build IR using
> an
> > > API.
> > > >>>
> > > >>> The primary downside to this is that we'd have to deal with
> > > >>> building another library while working out any kinks in the IR
> design
> > > and
> > > >>> I'd
> > > >>> rather avoid that in the initial phases of this project.
> > > >>>
> > > >>> The benefit is that IR components don't interact much with
> > > `flatbuffers` or
> > > >>> `flatc` directly.
> > > >>>
> > > >>> 2. A single location where the format lives, that doesn't require
> > > depending
> > > >>> on
> > > >>>     a large multi-language repository to access a handful of files.
> > > >>>
> > > >>> I think the downside to this is that there's a bit of additional
> > > >>> infrastructure
> > > >>> to automate copying in `arrow-format`.
> > > >>>
> > > >>> The benefit there is that producers and consumers can immediately
> > start
> > > >>> getting
> > > >>> value from compute IR without having to wait for development of a
> new
> > > API.
> > > >>>
> > > >>> One counter-proposal might be to just put the compute IR IDL in a
> > > separate
> > > >>> repo,
> > > >>> but that isn't tenable because the compute IR needs arrow's type
> > > >>> information
> > > >>> contained in `Schema.fbs`.
> > > >>>
> > > >>> I was hoping to avoid conflating the discussion about bindings vs
> > > direct
> > > >>> flatbuffer usage (at least initially just supporting one, I predict
> > > we'll
> > > >>> need
> > > >>> both ultimately) with the decision about whether to split out the
> > > format
> > > >>> directory, but it's a good example of a choice for which splitting
> > out
> > > the
> > > >>> format directory would be well-served.
> > > >>>
> > > >>> I'll note that this doesn't block anything on the compute IR side,
> > just
> > > >>> wanted
> > > >>> to surface this in a parallel thread and see what folks think.
> > > >>>
> > > >>> [1]:
> > > >>>
> > > >>>
> > >
> >
> https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E
> > > >>> [2]:
> > > >>>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l
> > > >>>
> > >
> >
>

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to