Agreed. I hope that I didn't come off as flippant with respect to performance.
I was hoping to convey that I think focusing on performance before we have the semantics and high level design nailed down is not time well spent. I think the current design doesn't depend on the format, which is a good thing: we can pick the format that best suits the needs of the community, and since performance is a big part of arrow, that likely means picking a format that is also geared towards performance. On Fri, Aug 13, 2021 at 2:57 PM Keith Kraus <keith.j.kr...@gmail.com> wrote: > > Personally, I do not care about the speed of IR processing right now. > > Any non-trivial (and probably trivial too) computation done > > by an IR consumer will dwarf the cost of IR processing. Of course, > > we shouldn't prematurely pessimize either, but there's no reason > > to spend time worrying about IR processing performance in my opinion > (yet). > > In other processing engines I've seen situations somewhat commonly where > the time to build the compute graph becomes non-negligible and even more > expensive than doing the computation itself. I've even seen situations > where attempts were made to iteratively build a graph while executing in > order to try to overlap the cost of building the graph with the compute > execution. > > There's been a huge amount of effort put into optimizing critical kernel > components like the hash table implementation in order to make Arrow the > most performant analytical library possible. Architecting and designing the > IR implementation without performance in mind from the beginning could > potentially put us into a difficult situation later that we'd have to > invest considerably more effort to work our way out of. > > On Fri, Aug 13, 2021 at 2:30 PM Weston Pace <weston.p...@gmail.com> wrote: > > > I believe you would need a JSON compatible version of the type system > > (including binary values) because you'd need to at least encode > > literals. However, I don't think that creating a human readable > > encoding of the Arrow type system is a bad thing in and of itself. We > > have tickets and get questions occasionally asking for a JSON format. > > This could at least be a step in that direction. I don't think you'd > > need to add support for arrays/batches/tables. Note, the C++ > > implementation has a JSON format that is used for testing purposes > > (though I do not believe it is comprehensive). > > > > I think we could add two (potentially conflicting) requirements > > * Low barrier to entry for consumers > > * Low barrier to entry for producers > > > > JSON/YAML seem to lower the barrier to entry for producers. Some > > producers may not even be working with Arrow data (e.g. could one go > > from SQL-literal -> JSON-literal skipping an intermediate > > Arrow-literal step?). I think we've also dismissed Antoine's earlier > > point which I found the most compelling. Handling flatbuffers adds > > one more step that people have to integrate into their build systems. > > > > Flatbuffers on the other hand lowers the barrier to entry for > > consumers. A consumer is likely already going to have flatbuffers > > support built in so that they can read/write IPC files. If we adopt > > JSON then the consumer will have to add support for a new file format > > (or at least part of one). > > > > On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn <quinn.jac...@gmail.com> > > wrote: > > > > > > > > > > > I just thought of one other requirement: the format needs to support > > > > arbitrary byte sequences. > > > > > > > Can you clarify why this is needed? Is it that custom_metadata maps > > should > > > allow byte sequences as values? > > > > > > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud <cpcl...@gmail.com> > > wrote: > > > > > > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou <anto...@python.org> > > > > wrote: > > > > > > > > > > > > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit : > > > > > > > > > > > >> I.e. make the ability to read and write by humans be more > > important > > > > than > > > > > >> speed of validation. > > > > > > > > > > > > I think I differ on whether the IR should be easy to read and > > write by > > > > > > humans. > > > > > > IR is going to be predominantly read and written by machines, > > though of > > > > > > course > > > > > > we will need a way to inspect it for debugging. > > > > > > > > > > But the code executed by machines is written by humans. I think > > that's > > > > > mostly where the contention resides: is it easy to code, in any > given > > > > > language, the routines required to produce or consume the IR? > > > > > > > > > > > > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to > > use in > > > > any language except C++, > > > > and it's borderline annoying there too. Protobuf is similar (less > > annoying > > > > in Rust, > > > > but still annoying in Python and C++ IMO), though I think any binary > > format > > > > is going to be > > > > less human-friendly, by construction. > > > > > > > > If we were to use something like JSON or msgpack, can someone sketch > > out > > > > the interaction > > > > between the IR and the rest of arrow's type system? > > > > > > > > Would we need a JSON-encoded-arrow-type -> in-memory representation > > for an > > > > Arrow type in a given language? > > > > > > > > I just thought of one other requirement: the format needs to support > > > > arbitrary byte sequences. JSON > > > > doesn't support untransformed byte sequences, though it's not > uncommon > > to > > > > base64-encode a byte sequence. > > > > IMO that adds an unnecessary layer of complexity, which is another > > tradeoff > > > > to consider. > > > > > > >