I would also be interested in having a reusable serialized format for filter- and projection-like expressions. I think trying to go so far as full logical query plans suitable for building a SQL engine is perhaps a bit too far but we could start small with the use case from the JNI Datasets PR as a motivating example. We should also consider replacing or deprecating Gandiva's serialized expressions in favor of something more general.
It may be a slight bikeshed issue, but I wouldn't be thrilled about having this be based on Protocol Buffers, because of the runtime requirement (on libprotobuf.so / libprotobuf.a) it introduces into C++ applications. Flatbuffers might be less pleasant developer UX in Java but at least in C++ the fact that Flatbuffers results in zero build- or runtime dependencies is a significant advantage. On Mon, Jul 6, 2020 at 4:12 PM Andy Grove <andygrov...@gmail.com> wrote: > > This is something that I am also interested in. > > My current approach in my personal project that uses Arrow is to use > protobuf to represent expressions (as well as logical and physical query > plans). I used the Gandiva protobuf definition as a starting point. > > Protobuf works for going between different languages in the same process as > well as for passing query plans over the network. I'm passing these > protobuf definitions over the Flight protocol. > > I only have support for a few simple expressions so far, but here is my > protobuf file for reference: > > https://github.com/ballista-compute/ballista/blob/main/proto/ballista.proto > > Andy. > > On Mon, Jul 6, 2020 at 1:50 PM Steve Kim <chairm...@gmail.com> wrote: > > > I have been following the discussion on a pull request ( > > https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the > > high-level dataset API via JNI. > > > > An obstacle that was encountered in this PR is that there is not a good way > > to pass a filter expression via JNI. Expressions have a defined > > serialization in the C++ implementation, but this serialization includes > > enums and types that are only defined in C++ and are not accessible in > > other languages. > > > > I agree with Micah Kornfield's comment ( > > https://github.com/apache/arrow/pull/7030#discussion_r425563920) that > > there > > ought to be one representation that we reuse across languages. If we had > > this cross-language functionality, then we could do the following: > > > > 1. build an arbitrary filter expression in Java > > 2. serialize the expression to bytes to be passed via JNI > > 3. deserialize from bytes to a native filter expression in the C++ > > implementation > > > > Has there already been discussion about what a cross-language > > representation of filter expressions (and possibly other parts of the > > Dataset API) might look like? I see that we use Flatbuffers in other parts > > of Arrow. > > > > What would need to change in the C++ implementation to make use of such a > > representation? > > > > Steve > >