Hi, The requirements for the compute IR as I see it are: > > * Implementations in IR producer and consumer languages. > * Strongly typed or the ability to easily validate a payload >
What about: 1. easy to read and write by a large number of programming languages 2. easy to read and write by humans 3. fast to validate by a large number of programming languages I.e. make the ability to read and write by humans be more important than speed of validation. In this order, JSON/toml/yaml are preferred because they are supported by more languages and more human readable than faster to validate. ----- My understanding is that for an async experience, we need the ability to `.await` at any `read_X` call so that if the read_X requests more bytes than are buffered, the `read_X(...).await` triggers a new (async) request to fill the buffer (which puts the future on a Pending state). When a library does not offer the async version of `read_X`, any read_X can force a request to fill the buffer, which is now blocking the thread. One way around this is to wrap those blocking calls in async (e.g. via tokio::spawn_blocking). However, this forces users to use that runtime, or to create a new independent thread pool for their own async work. Neither are great for low-level libraries. E.g. thrift does not offer async -> parquet-format-rs does not offer async -> parquet does not offer async -> datafusion wraps all parquet "IO-bounded and CPU-bounded operations" in spawn_blocking or something equivalent. Best, Jorge On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud <cpcl...@gmail.com> wrote: > On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > I agree with Antoine that we should weigh the pros and cons of > flatbuffers > > (or protobuf or thrift for that matter) over a more human-friendly, > > simpler, format like json or MsgPack. I also struggle a bit to reason > with > > the complexity of using flatbuffers for this. > > > > Ultimately I think different representations of the format will emerge if > compute IR is successful, > and people will implement JSON/proto/thrift/etc versions of the IR. > > The requirements for the compute IR as I see it are: > > * Implementations in IR producer and consumer languages. > * Strongly typed or the ability to easily validate a payload > > It seems like Protobuf, Flatbuffers and JSON all meet the criteria here. > Beyond that, > there's precedence in the codebase for flatbuffers (which is just to say > that flatbuffers > is the devil we know). > > Can people list other concrete requirements for the format? A > non-requirement might > be that there be _idiomatic_ implementations for every language arrow > supports, for example. > > I think without agreement on requirements we won't ever arrive at > consensus. > > The compute IR spec itself doesn't really depend on the specific choice of > format, but we > need to get some consensus on the format. > > > > E.g. there is no async support for thrift, flatbuffers nor protobuf in > > Rust, which e.g. means that we can't read neither parquet nor arrow IPC > > async atm. These problems are usually easier to work around in simpler > > formats. > > > > Can you elaborate a bit on the lack of async support here and what it would > mean for > a particular in-memory representation to support async, and why that > prevents reading > a parquet file using async? > > Looking at JSON as an example, most libraries in the Rust ecosystem use > serde and serde_json > to serialize and deserialize JSON, and any async concerns occur at the > level of > a client/server library like warp (or some transitive dependency thereof > like Hyper). > > Are you referring to something like the functionality implemented in > tokio-serde-json? If so, > I think you could probably build something for these other formats assuming > they have serde > support (flatbuffers notably does _not_, partially because of its incessant > need to own everything), > since tokio_serde is doing most of the work in tokio-serde-json. In any > case, I don't think > it's a requirement for the compute IR that there be a streaming transport > implementation for the > format. > > > > > > Best, > > Jorge > > > > > > > > On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > Le 12/08/2021 à 15:05, Wes McKinney a écrit : > > > > It seems that one adjacent problem here is how to make it simpler for > > > > third parties (especially ones that act as front end interfaces) to > > > > build and serialize/deserialize the IR structures with some kind of > > > > ready-to-go middleware library, written in a language like C++. > > > > > > A C++ library sounds a bit complicated to deal with for Java, Rust, Go, > > > etc. developers. > > > > > > I'm not sure which design decision and set of compromises would make > the > > > most sense. But this is why I'm asking the question "why not JSON?" (+ > > > JSON-Schema if you want to ease validation by third parties). > > > > > > (note I have already mentioned MsgPack, but only in the case a binary > > > encoding is really required; it doesn't have any other advantage that I > > > know of over JSON, and it's less ubiquitous) > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > To do that, one would need the equivalent of arrow/type.h and related > > > > Flatbuffers schema serialization code that lives in arrow/ipc. If you > > > > want to be able to completely and accurately serialize Schemas, you > > > > need quite a bit of code now. > > > > > > > > One possible approach (and not go crazy) would be to: > > > > > > > > * Move arrow/types.h and its dependencies into a standalone C++ > > > > library that can be vendored into the main apache/arrow C++ library. > I > > > > don't know how onerous arrow/types.h's transitive dependencies / > > > > interactions are at this point (there's a lot of stuff going on in > > > > type.cc [1] now) > > > > * Make the namespaces exported by this library configurable, so any > > > > library can vendor the Arrow types / IR builder APIs privately into > > > > their project > > > > * Maintain this "Arrow types and ComputeIR library" as an always > > > > zero-dependency library to facilitate vendoring > > > > * Lightweight bindings in languages we care about (like Python or R > or > > > > GLib/Ruby) could be built to the IR builder middleware library > > > > > > > > This seems like what is more at issue compared with rather projects > > > > are copying the Flatbuffers files out of their project from > > > > apache/arrow or apache/arrow-format. > > > > > > > > [1]: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc > > > > > > > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com> > > > wrote: > > > >> > > > >> I support the idea of an independent repo that has the arrow > > flatbuffers > > > >> format definition files. > > > >> > > > >> My rationale is that the Rust implementation has a copy of the > > `format` > > > >> directory [1] and potential drift worries me (a bit). Having a > single > > > >> source of truth for the format that is not part of the large mono > repo > > > >> would be a good thing. > > > >> > > > >> Andrew > > > >> > > > >> [1] https://github.com/apache/arrow-rs/tree/master/format > > > >> > > > >> On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <cpcl...@gmail.com> > > > wrote: > > > >> > > > >>> Hi all, > > > >>> > > > >>> I'd like to bring up an idea from a recent thread ([1]) about > moving > > > the > > > >>> `format/` directory out of the primary apache/arrow repository. > > > >>> > > > >>> I understand from that thread there are some concerns about using > > > >>> submodules, > > > >>> and I definitely sympathize with those concerns. > > > >>> > > > >>> In talking with David Li (disclaimer: we work together at Voltron > > > Data), he > > > >>> has > > > >>> a great idea that I think makes everyone happy: an > > > `apache/arrow-format` > > > >>> repository that is the official mirror for the flatbuffers IDL, > that > > > >>> library > > > >>> authors should use as the source of truth. > > > >>> > > > >>> It doesn't require a submodule, yet it also allows external > projects > > > the > > > >>> ability to access the IDL without having to interact with the main > > > arrow > > > >>> repository and is backwards compatible to boot. > > > >>> > > > >>> In this scenario, repositories that are currently copying in the > > > >>> flatbuffers > > > >>> IDL can migrate to this repository at their leisure. > > > >>> > > > >>> My motivation for this was around sharing data structures for the > > > compute > > > >>> IR > > > >>> proposal ([2]). > > > >>> > > > >>> I can think of at least two ways for IR producers and consumers of > > all > > > >>> languages to share the flatbuffers IDL: > > > >>> > > > >>> 1. A set of bindings built in some language that other languages > can > > > >>> integrate > > > >>> with, likely C++, that allows library users to build IR using > an > > > API. > > > >>> > > > >>> The primary downside to this is that we'd have to deal with > > > >>> building another library while working out any kinks in the IR > design > > > and > > > >>> I'd > > > >>> rather avoid that in the initial phases of this project. > > > >>> > > > >>> The benefit is that IR components don't interact much with > > > `flatbuffers` or > > > >>> `flatc` directly. > > > >>> > > > >>> 2. A single location where the format lives, that doesn't require > > > depending > > > >>> on > > > >>> a large multi-language repository to access a handful of files. > > > >>> > > > >>> I think the downside to this is that there's a bit of additional > > > >>> infrastructure > > > >>> to automate copying in `arrow-format`. > > > >>> > > > >>> The benefit there is that producers and consumers can immediately > > start > > > >>> getting > > > >>> value from compute IR without having to wait for development of a > new > > > API. > > > >>> > > > >>> One counter-proposal might be to just put the compute IR IDL in a > > > separate > > > >>> repo, > > > >>> but that isn't tenable because the compute IR needs arrow's type > > > >>> information > > > >>> contained in `Schema.fbs`. > > > >>> > > > >>> I was hoping to avoid conflating the discussion about bindings vs > > > direct > > > >>> flatbuffer usage (at least initially just supporting one, I predict > > > we'll > > > >>> need > > > >>> both ultimately) with the decision about whether to split out the > > > format > > > >>> directory, but it's a good example of a choice for which splitting > > out > > > the > > > >>> format directory would be well-served. > > > >>> > > > >>> I'll note that this doesn't block anything on the compute IR side, > > just > > > >>> wanted > > > >>> to surface this in a parallel thread and see what folks think. > > > >>> > > > >>> [1]: > > > >>> > > > >>> > > > > > > https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E > > > >>> [2]: > > > >>> > > > >>> > > > > > > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l > > > >>> > > > > > >