Re: [DISCUSS] Splitting out the Arrow format directory

Weston Pace Fri, 13 Aug 2021 11:30:43 -0700

I believe you would need a JSON compatible version of the type system
(including binary values) because you'd need to at least encode
literals.  However, I don't think that creating a human readable
encoding of the Arrow type system is a bad thing in and of itself.  We
have tickets and get questions occasionally asking for a JSON format.
This could at least be a step in that direction.  I don't think you'd
need to add support for arrays/batches/tables.  Note, the C++
implementation has a JSON format that is used for testing purposes
(though I do not believe it is comprehensive).


I think we could add two (potentially conflicting) requirements
 * Low barrier to entry for consumers
 * Low barrier to entry for producers

JSON/YAML seem to lower the barrier to entry for producers.  Some
producers may not even be working with Arrow data (e.g. could one go
from SQL-literal -> JSON-literal skipping an intermediate
Arrow-literal step?).  I think we've also dismissed Antoine's earlier
point which I found the most compelling.  Handling flatbuffers adds
one more step that people have to integrate into their build systems.

Flatbuffers on the other hand lowers the barrier to entry for
consumers.  A consumer is likely already going to have flatbuffers
support built in so that they can read/write IPC files.  If we adopt
JSON then the consumer will have to add support for a new file format
(or at least part of one).

On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn <quinn.jac...@gmail.com> wrote:
>
> >
> > I just thought of one other requirement: the format needs to support
> > arbitrary byte sequences.
> >
> Can you clarify why this is needed? Is it that custom_metadata maps should
> allow byte sequences as values?
>
> On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud <cpcl...@gmail.com> wrote:
>
> > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> >
> > >
> > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > > >
> > > >> I.e. make the ability to read and write by humans be more important
> > than
> > > >> speed of validation.
> > > >
> > > > I think I differ on whether the IR should be easy to read and write by
> > > > humans.
> > > > IR is going to be predominantly read and written by machines, though of
> > > > course
> > > > we will need a way to inspect it for debugging.
> > >
> > > But the code executed by machines is written by humans.  I think that's
> > > mostly where the contention resides: is it easy to code, in any given
> > > language, the routines required to produce or consume the IR?
> > >
> >
> > Definitely not for flatbuffers, since flatbuffers is IMO annoying to use in
> > any language except C++,
> > and it's borderline annoying there too. Protobuf is similar (less annoying
> > in Rust,
> > but still annoying in Python and C++ IMO), though I think any binary format
> > is going to be
> > less human-friendly, by construction.
> >
> > If we were to use something like JSON or msgpack, can someone sketch out
> > the interaction
> > between the IR and the rest of arrow's type system?
> >
> > Would we need a JSON-encoded-arrow-type -> in-memory representation for an
> > Arrow type in a given language?
> >
> > I just thought of one other requirement: the format needs to support
> > arbitrary byte sequences. JSON
> > doesn't support untransformed byte sequences, though it's not uncommon to
> > base64-encode a byte sequence.
> > IMO that adds an unnecessary layer of complexity, which is another tradeoff
> > to consider.
> >

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to