Re: [DISCUSS] Splitting out the Arrow format directory

Phillip Cloud Wed, 11 Aug 2021 16:16:27 -0700

On Wed, Aug 11, 2021, 19:05 Weston Pace <weston.p...@gmail.com> wrote:


> >> The benefit is that IR components don't interact much with
> `flatbuffers` or
> >> `flatc` directly.
> >>
> [...]
> >>
> >> One counter-proposal might be to just put the compute IR IDL in a
> separate
> >> repo,
> >> but that isn't tenable because the compute IR needs arrow's type
> information
> >> contained in `Schema.fbs`.
>
> > This argument seems predated on the hypothesis that the compute IR will
> > use Flatbuffers.  Is it set in stone?
>
> +1 for the original proposal (mirror repo for specs).  I don't think
> we have to figure out the IR format.  It makes sense for all language
> independent specs to be in a single place regardless of format.  If IR
> picked JSON I would still argue the JSON schemas for IR belong in the
> same repository as the Arrow columnar format flatbuffers files.  It
> makes it clear what is spec and what is implementation / toolkit.
> Especially since a mirror repo should be pretty low maintenance.
>

That's a good point. I hadn't considered that point of view, but I think
you're right that specs, regardless of wire format should remain together.


> On Wed, Aug 11, 2021 at 11:34 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >
> > Le 11/08/2021 à 23:06, Phillip Cloud a écrit :
> > > On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou <anto...@python.org>
> wrote:
> > >
> > >> Le 11/08/2021 à 22:16, Phillip Cloud a écrit :
> > >>>
> > >>> Yeah, that is a drawback here, though I don't see needing to run
> flatc
> > >> as a
> > >>> major downside given the upside
> > >>> of not having to write additional code to move between formats.
> > >>
> > >> That's only an advantage if you already know how to read the Arrow IPC
> > >> format (and, yes, in this case you already run `flatc`).  Some
> projects
> > >> probably don't care about Arrow IPC (Dask, for example).
> > >
> > >
> > > I don't think it's about the IPC though, at least for the compute IR
> use
> > > case.
> > > Am I missing something there?
> >
> > If you're not handling the Arrow IPC format, then you probably don't
> > have an encoder/decoder for Schema.fbs, so the "upside of not having to
> > write additional code to move between formats" doesn't exist (unless I'm
> > misunderstanding your point?).
> >
> > > I do think a downside of not using something like JSON or msgpack is
> > > that schema validation must be implemented by both the producer and the
> > > consumer.
> > > That means we'd have a number of other consequential decisions to make:
> > >
> > > * Do we provide the validation library?
> > > * If not, do all the languages arrow supports have high quality
> libraries
> > > for validating schemas?
> > > * If so, then we have to implement/maintain/release/bugfix that.
> >
> > This is true.  However, Flatbuffers doesn't validate much on its own,
> > either, because its IDL is not expressive enough.  For example,
> > `Schema.fbs` allows you to declare a INT8 field with children, a LIST
> > field without any children, a non-nullable NULL field...
> >
> > (also, there's JSON Schema: https://json-schema.org/)
> >
> > Regards
> >
> > Antoine.
>

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to