Re: [DISCUSS] Splitting out the Arrow format directory

Andrew Lamb Thu, 12 Aug 2021 05:05:31 -0700

I support the idea of an independent repo that has the arrow flatbuffers
format definition files.


My rationale is that the Rust implementation has a copy of the `format`
directory [1] and potential drift worries me (a bit). Having a single
source of truth for the format that is not part of the large mono repo
would be a good thing.

Andrew

[1] https://github.com/apache/arrow-rs/tree/master/format

On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <cpcl...@gmail.com> wrote:

> Hi all,
>
> I'd like to bring up an idea from a recent thread ([1]) about moving the
> `format/` directory out of the primary apache/arrow repository.
>
> I understand from that thread there are some concerns about using
> submodules,
> and I definitely sympathize with those concerns.
>
> In talking with David Li (disclaimer: we work together at Voltron Data), he
> has
> a great idea that I think makes everyone happy: an `apache/arrow-format`
> repository that is the official mirror for the flatbuffers IDL, that
> library
> authors should use as the source of truth.
>
> It doesn't require a submodule, yet it also allows external projects the
> ability to access the IDL without having to interact with the main arrow
> repository and is backwards compatible to boot.
>
> In this scenario, repositories that are currently copying in the
> flatbuffers
> IDL can migrate to this repository at their leisure.
>
> My motivation for this was around sharing data structures for the compute
> IR
> proposal ([2]).
>
> I can think of at least two ways for IR producers and consumers of all
> languages to share the flatbuffers IDL:
>
> 1. A set of bindings built in some language that other languages can
> integrate
>    with, likely C++, that allows library users to build IR using an API.
>
> The primary downside to this is that we'd have to deal with
> building another library while working out any kinks in the IR design and
> I'd
> rather avoid that in the initial phases of this project.
>
> The benefit is that IR components don't interact much with `flatbuffers` or
> `flatc` directly.
>
> 2. A single location where the format lives, that doesn't require depending
> on
>    a large multi-language repository to access a handful of files.
>
> I think the downside to this is that there's a bit of additional
> infrastructure
> to automate copying in `arrow-format`.
>
> The benefit there is that producers and consumers can immediately start
> getting
> value from compute IR without having to wait for development of a new API.
>
> One counter-proposal might be to just put the compute IR IDL in a separate
> repo,
> but that isn't tenable because the compute IR needs arrow's type
> information
> contained in `Schema.fbs`.
>
> I was hoping to avoid conflating the discussion about bindings vs direct
> flatbuffer usage (at least initially just supporting one, I predict we'll
> need
> both ultimately) with the decision about whether to split out the format
> directory, but it's a good example of a choice for which splitting out the
> format directory would be well-served.
>
> I'll note that this doesn't block anything on the compute IR side, just
> wanted
> to surface this in a parallel thread and see what folks think.
>
> [1]:
>
> https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E
> [2]:
>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l
>

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to