For reference, the doc (from eight years ago) I meant to link in my initial message was: https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
On Sat, Jul 11, 2020, 11:24 AM Wes McKinney <wesmck...@gmail.com> wrote: > On Sat, Jul 11, 2020 at 11:55 AM Jacques Nadeau <jacq...@apache.org> > wrote: > > > > On Mon, Jul 6, 2020 at 2:45 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > I would also be interested in having a reusable serialized format for > > > filter- and projection-like expressions. I think trying to go so far > > > as full logical query plans suitable for building a SQL engine is > > > perhaps a bit too far but we could start small with the use case from > > > the JNI Datasets PR as a motivating example. We should also consider > > > replacing or deprecating Gandiva's serialized expressions in favor of > > > something more general. > > > > > > > Gandiva's representation was very much focused on being general. It > > attempts to be a generalized way to express what is known in Calcite as > > RowExpressions (or Rex for short). This is separate from relational > > expressions. It was built after having both defined a language > independent > > graph representation for early versions of Drill [1] and then working > > extensively with Calcite and various types of expression > transformations. I > > don't think a second representation should be considered until there is > > very strong proof that the Gandiva approach is somehow architecturally > > limited and/or cannot be modified/improved to solve new needs. > > > > > > > It may be a slight bikeshed issue, but I wouldn't be thrilled about > > > having this be based on Protocol Buffers, because of the runtime > > > requirement (on libprotobuf.so / libprotobuf.a) it introduces into C++ > > > applications. Flatbuffers might be less pleasant developer UX in Java > > > but at least in C++ the fact that Flatbuffers results in zero build- > > > or runtime dependencies is a significant advantage. > > > > > > > There is a canonical JSON representation of all protobuf. Using that > > representation requires zero build and runtime dependencies. > > > > One of the strengths of the Arrow community is the combination of SQL and > > data science expertise. The Gandiva work is used in production within a > SQL > > engine on 1000s of nodes everyday (and within a Java context). > Introducing > > a second row expression approach feels like NIH. Especially when the > first > > use case is trying to pass row-wise expressions between Java and C++ > (which > > is exactly what the Gandiva expression syntax was built to do). > > > > I'm against extending use of flatbuf within Arrow. The language support > is > > too weak. Language support isn't just about having a binding for > different > > languages, it is about having a high-quality binding. > > I think we will need to analyze this more rigorously and solicit more > opinions. I believe the negative impact of Protobuf (to summarize: > dependency hell, since there are other libraries out there that > statically link protobuf symbols) on native code applications > significantly outweighs some of the developer UX issues in Java (which > pose a modest inconvenience when implementing the serialization > layer). We've already seen protobuf-related dependency clashes, and so > if we push libprotobuf into more third party libraries that need to > interface with Arrow it will almost definitely cause some headaches. > It may come down to choosing the lesser of evils. >