I would also be interested in having a reusable serialized format for
filter- and projection-like expressions. I think trying to go so far
as full logical query plans suitable for building a SQL engine is
perhaps a bit too far but we could start small with the use case from
the JNI Datasets PR as a motivating example. We should also consider
replacing or deprecating Gandiva's serialized expressions in favor of
something more general.

It may be a slight bikeshed issue, but I wouldn't be thrilled about
having this be based on Protocol Buffers, because of the runtime
requirement (on libprotobuf.so / libprotobuf.a) it introduces into C++
applications. Flatbuffers might be less pleasant developer UX in Java
but at least in C++ the fact that Flatbuffers results in zero build-
or runtime dependencies is a significant advantage.

On Mon, Jul 6, 2020 at 4:12 PM Andy Grove <andygrov...@gmail.com> wrote:
>
> This is something that I am also interested in.
>
> My current approach in my personal project that uses Arrow is to use
> protobuf to represent expressions (as well as logical and physical query
> plans). I used the Gandiva protobuf definition as a starting point.
>
> Protobuf works for going between different languages in the same process as
> well as for passing query plans over the network. I'm passing these
> protobuf definitions over the Flight protocol.
>
> I only have support for a few simple expressions so far, but here is my
> protobuf file for reference:
>
> https://github.com/ballista-compute/ballista/blob/main/proto/ballista.proto
>
> Andy.
>
> On Mon, Jul 6, 2020 at 1:50 PM Steve Kim <chairm...@gmail.com> wrote:
>
> > I have been following the discussion on a pull request (
> > https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the
> > high-level dataset API via JNI.
> >
> > An obstacle that was encountered in this PR is that there is not a good way
> > to pass a filter expression via JNI. Expressions have a defined
> > serialization in the C++ implementation, but this serialization includes
> > enums and types that are only defined in C++ and are not accessible in
> > other languages.
> >
> > I agree with Micah Kornfield's comment (
> > https://github.com/apache/arrow/pull/7030#discussion_r425563920) that
> > there
> > ought to be one representation that we reuse across languages. If we had
> > this cross-language functionality, then we could do the following:
> >
> >    1. build an arbitrary filter expression in Java
> >    2. serialize the expression to bytes to be passed via JNI
> >    3. deserialize from bytes to a native filter expression in the C++
> >    implementation
> >
> > Has there already been discussion about what a cross-language
> > representation of filter expressions (and possibly other parts of the
> > Dataset API) might look like? I see that we use Flatbuffers in other parts
> > of Arrow.
> >
> > What would need to change in the C++ implementation to make use of such a
> > representation?
> >
> > Steve
> >

Reply via email to