I support the idea of an independent repo that has the arrow flatbuffers format definition files.
My rationale is that the Rust implementation has a copy of the `format` directory [1] and potential drift worries me (a bit). Having a single source of truth for the format that is not part of the large mono repo would be a good thing. Andrew [1] https://github.com/apache/arrow-rs/tree/master/format On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <cpcl...@gmail.com> wrote: > Hi all, > > I'd like to bring up an idea from a recent thread ([1]) about moving the > `format/` directory out of the primary apache/arrow repository. > > I understand from that thread there are some concerns about using > submodules, > and I definitely sympathize with those concerns. > > In talking with David Li (disclaimer: we work together at Voltron Data), he > has > a great idea that I think makes everyone happy: an `apache/arrow-format` > repository that is the official mirror for the flatbuffers IDL, that > library > authors should use as the source of truth. > > It doesn't require a submodule, yet it also allows external projects the > ability to access the IDL without having to interact with the main arrow > repository and is backwards compatible to boot. > > In this scenario, repositories that are currently copying in the > flatbuffers > IDL can migrate to this repository at their leisure. > > My motivation for this was around sharing data structures for the compute > IR > proposal ([2]). > > I can think of at least two ways for IR producers and consumers of all > languages to share the flatbuffers IDL: > > 1. A set of bindings built in some language that other languages can > integrate > with, likely C++, that allows library users to build IR using an API. > > The primary downside to this is that we'd have to deal with > building another library while working out any kinks in the IR design and > I'd > rather avoid that in the initial phases of this project. > > The benefit is that IR components don't interact much with `flatbuffers` or > `flatc` directly. > > 2. A single location where the format lives, that doesn't require depending > on > a large multi-language repository to access a handful of files. > > I think the downside to this is that there's a bit of additional > infrastructure > to automate copying in `arrow-format`. > > The benefit there is that producers and consumers can immediately start > getting > value from compute IR without having to wait for development of a new API. > > One counter-proposal might be to just put the compute IR IDL in a separate > repo, > but that isn't tenable because the compute IR needs arrow's type > information > contained in `Schema.fbs`. > > I was hoping to avoid conflating the discussion about bindings vs direct > flatbuffer usage (at least initially just supporting one, I predict we'll > need > both ultimately) with the decision about whether to split out the format > directory, but it's a good example of a choice for which splitting out the > format directory would be well-served. > > I'll note that this doesn't block anything on the compute IR side, just > wanted > to surface this in a parallel thread and see what folks think. > > [1]: > > https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E > [2]: > > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l >