Re: [DISCUSS] Splitting out the Arrow format directory

Phillip Cloud Fri, 13 Aug 2021 14:00:55 -0700

Agreed. I hope that I didn't come off as flippant with respect to
performance.


I was hoping to convey that I think focusing on performance before we have
the semantics and high level design nailed down is not time well spent.

I think the current design doesn't depend on the format,
which is a good thing: we can pick the format that best suits the needs
of the community, and since performance is a big part of arrow,
that likely means picking a format that is also geared towards
performance.

On Fri, Aug 13, 2021 at 2:57 PM Keith Kraus <keith.j.kr...@gmail.com> wrote:

> > Personally, I do not care about the speed of IR processing right now.
> > Any non-trivial (and probably trivial too) computation done
> > by an IR consumer will dwarf the cost of IR processing. Of course,
> > we shouldn't prematurely pessimize either, but there's no reason
> > to spend time worrying about IR processing performance in my opinion
> (yet).
>
> In other processing engines I've seen situations somewhat commonly where
> the time to build the compute graph becomes non-negligible and even more
> expensive than doing the computation itself. I've even seen situations
> where attempts were made to iteratively build a graph while executing in
> order to try to overlap the cost of building the graph with the compute
> execution.
>
> There's been a huge amount of effort put into optimizing critical kernel
> components like the hash table implementation in order to make Arrow the
> most performant analytical library possible. Architecting and designing the
> IR implementation without performance in mind from the beginning could
> potentially put us into a difficult situation later that we'd have to
> invest considerably more effort to work our way out of.
>
> On Fri, Aug 13, 2021 at 2:30 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > I believe you would need a JSON compatible version of the type system
> > (including binary values) because you'd need to at least encode
> > literals.  However, I don't think that creating a human readable
> > encoding of the Arrow type system is a bad thing in and of itself.  We
> > have tickets and get questions occasionally asking for a JSON format.
> > This could at least be a step in that direction.  I don't think you'd
> > need to add support for arrays/batches/tables.  Note, the C++
> > implementation has a JSON format that is used for testing purposes
> > (though I do not believe it is comprehensive).
> >
> > I think we could add two (potentially conflicting) requirements
> >  * Low barrier to entry for consumers
> >  * Low barrier to entry for producers
> >
> > JSON/YAML seem to lower the barrier to entry for producers.  Some
> > producers may not even be working with Arrow data (e.g. could one go
> > from SQL-literal -> JSON-literal skipping an intermediate
> > Arrow-literal step?).  I think we've also dismissed Antoine's earlier
> > point which I found the most compelling.  Handling flatbuffers adds
> > one more step that people have to integrate into their build systems.
> >
> > Flatbuffers on the other hand lowers the barrier to entry for
> > consumers.  A consumer is likely already going to have flatbuffers
> > support built in so that they can read/write IPC files.  If we adopt
> > JSON then the consumer will have to add support for a new file format
> > (or at least part of one).
> >
> > On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn <quinn.jac...@gmail.com>
> > wrote:
> > >
> > > >
> > > > I just thought of one other requirement: the format needs to support
> > > > arbitrary byte sequences.
> > > >
> > > Can you clarify why this is needed? Is it that custom_metadata maps
> > should
> > > allow byte sequences as values?
> > >
> > > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud <cpcl...@gmail.com>
> > wrote:
> > >
> > > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou <anto...@python.org>
> > > > wrote:
> > > >
> > > > >
> > > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > > > > >
> > > > > >> I.e. make the ability to read and write by humans be more
> > important
> > > > than
> > > > > >> speed of validation.
> > > > > >
> > > > > > I think I differ on whether the IR should be easy to read and
> > write by
> > > > > > humans.
> > > > > > IR is going to be predominantly read and written by machines,
> > though of
> > > > > > course
> > > > > > we will need a way to inspect it for debugging.
> > > > >
> > > > > But the code executed by machines is written by humans.  I think
> > that's
> > > > > mostly where the contention resides: is it easy to code, in any
> given
> > > > > language, the routines required to produce or consume the IR?
> > > > >
> > > >
> > > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to
> > use in
> > > > any language except C++,
> > > > and it's borderline annoying there too. Protobuf is similar (less
> > annoying
> > > > in Rust,
> > > > but still annoying in Python and C++ IMO), though I think any binary
> > format
> > > > is going to be
> > > > less human-friendly, by construction.
> > > >
> > > > If we were to use something like JSON or msgpack, can someone sketch
> > out
> > > > the interaction
> > > > between the IR and the rest of arrow's type system?
> > > >
> > > > Would we need a JSON-encoded-arrow-type -> in-memory representation
> > for an
> > > > Arrow type in a given language?
> > > >
> > > > I just thought of one other requirement: the format needs to support
> > > > arbitrary byte sequences. JSON
> > > > doesn't support untransformed byte sequences, though it's not
> uncommon
> > to
> > > > base64-encode a byte sequence.
> > > > IMO that adds an unnecessary layer of complexity, which is another
> > tradeoff
> > > > to consider.
> > > >
> >
>

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to