I agree this would be a great development. It would also be useful for leveraging compute engines from JS via wasm.
I've thought about something like this in the context of multi-language relational workloads in Apache Beam, mostly just leading me to wonder if something like it already exists. But so far I haven't found it. On Thu, Mar 18, 2021 at 7:39 AM Wes McKinney <wesmck...@gmail.com> wrote: > I completely agree with developing a common “query protocol” or “physical > execution plan” IR + serialization scheme inside Apache Arrow. It may take > some time to stabilize so we should try to avoid being hasty in closing it > to change until more time has elapsed to allow requirements to percolate. > > On Thu, Mar 18, 2021 at 8:17 AM Andy Grove <andygrov...@gmail.com> wrote: > > > Hi Paddy, > > > > Thanks for raising this. > > > > Ballista defines computations using protobuf [1] to describe logical and > > physical query plans, which consist of operators and expressions. It is > > actually based on the Gandiva protobuf [2] for describing expressions. > > > > I see a lot of value in standardizing some of this across > implementations. > > Ballista is essentially becoming a distributed scheduler for Arrow and > can > > work with any implementation that supports this protobuf definition of > > query plans. > > > > It would also make it easier to embed C++ in Rust, or Rust in C++, having > > this common IR, so I would be all for having something like this as an > > Arrow specification. > > > > Thanks, > > > > Andy. > > > > [1] > > > > > https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto > > [2] > > > > > https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto > > > > > > On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com> > > wrote: > > > > > Hi All, > > > > > > I do not have a computer science background so I may not be asking this > > in > > > the correct way or using the correct terminology but I wonder if we can > > > achieve some level of standardization when describing computation over > > > Arrow data. > > > > > > At the moment on the Rust side DataFusion clearly has a way to describe > > > computation, I believe that Ballista adds the ability to serialize this > > to > > > allow distributed computation. On the C++ side work is starting on a > > > similar query engine and we already have Gandiva. Is there an > > opportunity > > > to define a kind of IR for computation over Arrow data that could be > > > adopted across implementations? > > > > > > In this case DataFusion could easily incorporate Gandiva to generate > > > optimized compute kernels if they were using the same IR to describe > > > computation. Applications built on Arrow could "describe" computation > in > > > any language and take advantage or innovations across the community, > > adding > > > this to Arrow's zero copy data sharing could be a game changer in my > > mind. > > > I'm not someone who knows enough to drive this forward but I obviously > > > would like to get involved. For some time I was playing around with > > using > > > TVM's relay IR [1] and applying it to Arrow data. > > > > > > As the Arrow memory format has now matured I fell like this could be > the > > > next step. Is there any plan for this kind of work or are we going to > > > allow sub-projects to "go their own way"? > > > > > > Thanks, > > > Paddy > > > > > > [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation ( > apache.org > > )< > > > https://tvm.apache.org/docs/dev/relay_intro.html> > > > > > > > > >