Re: language independent representation of filter expressions

Jacques Nadeau Sat, 11 Jul 2020 13:03:14 -0700

For reference, the doc (from eight years ago) I meant to link in my initial
message was:
https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit


On Sat, Jul 11, 2020, 11:24 AM Wes McKinney <wesmck...@gmail.com> wrote:

> On Sat, Jul 11, 2020 at 11:55 AM Jacques Nadeau <jacq...@apache.org>
> wrote:
> >
> > On Mon, Jul 6, 2020 at 2:45 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > I would also be interested in having a reusable serialized format for
> > > filter- and projection-like expressions. I think trying to go so far
> > > as full logical query plans suitable for building a SQL engine is
> > > perhaps a bit too far but we could start small with the use case from
> > > the JNI Datasets PR as a motivating example. We should also consider
> > > replacing or deprecating Gandiva's serialized expressions in favor of
> > > something more general.
> > >
> >
> > Gandiva's representation was very much focused on being general. It
> > attempts to be a generalized way to express what is known in Calcite as
> > RowExpressions (or Rex for short). This is separate from relational
> > expressions. It was built after having both defined a language
> independent
> > graph representation for early versions of Drill [1] and then working
> > extensively with Calcite and various types of expression
> transformations. I
> > don't think a second representation should be considered until there is
> > very strong proof that the Gandiva approach is somehow architecturally
> > limited and/or cannot be modified/improved to solve new needs.
> >
> >
> > > It may be a slight bikeshed issue, but I wouldn't be thrilled about
> > > having this be based on Protocol Buffers, because of the runtime
> > > requirement (on libprotobuf.so / libprotobuf.a) it introduces into C++
> > > applications. Flatbuffers might be less pleasant developer UX in Java
> > > but at least in C++ the fact that Flatbuffers results in zero build-
> > > or runtime dependencies is a significant advantage.
> > >
> >
> > There is a canonical JSON representation of all protobuf. Using that
> > representation requires zero build and runtime dependencies.
> >
> > One of the strengths of the Arrow community is the combination of SQL and
> > data science expertise. The Gandiva work is used in production within a
> SQL
> > engine on 1000s of nodes everyday (and within a Java context).
> Introducing
> > a second row expression approach feels like NIH. Especially when the
> first
> > use case is trying to pass row-wise expressions between Java and C++
> (which
> > is exactly what the Gandiva expression syntax was built to do).
> >
> > I'm against extending use of flatbuf within Arrow. The language support
> is
> > too weak. Language support isn't just about having a binding for
> different
> > languages, it is about having a high-quality binding.
>
> I think we will need to analyze this more rigorously and solicit more
> opinions. I believe the negative impact of Protobuf (to summarize:
> dependency hell, since there are other libraries out there that
> statically link protobuf symbols) on native code applications
> significantly outweighs some of the developer UX issues in Java (which
> pose a modest inconvenience when implementing the serialization
> layer). We've already seen protobuf-related dependency clashes, and so
> if we push libprotobuf into more third party libraries that need to
> interface with Arrow it will almost definitely cause some headaches.
> It may come down to choosing the lesser of evils.
>

Re: language independent representation of filter expressions

Reply via email to