I'm interested in providing some path to make this extensible. To pick an example, suppose the user wants to compute the first k principle components. We've talked [1] about the possibility of incorporating richer communication semantics in Ballista (a la MPI sub-communicators) and numerical algorithms such as PCA would benefit. Those specific algorithms wouldn't belong in Arrow or Ballista core, but I think there's an opportunity for plugins to offer this sort of capability and it would be lovely if the language-independent protocol could call them. Do you see a good way to do this via ballista.proto?
[1] https://github.com/ballista-compute/ballista/issues/303 Andy Grove <andygrov...@gmail.com> writes: > Hi Paddy, > > Thanks for raising this. > > Ballista defines computations using protobuf [1] to describe logical and > physical query plans, which consist of operators and expressions. It is > actually based on the Gandiva protobuf [2] for describing expressions. > > I see a lot of value in standardizing some of this across implementations. > Ballista is essentially becoming a distributed scheduler for Arrow and can > work with any implementation that supports this protobuf definition of > query plans. > > It would also make it easier to embed C++ in Rust, or Rust in C++, having > this common IR, so I would be all for having something like this as an > Arrow specification. > > Thanks, > > Andy. > > [1] > https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto > [2] > https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto > > > On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com> wrote: > >> Hi All, >> >> I do not have a computer science background so I may not be asking this in >> the correct way or using the correct terminology but I wonder if we can >> achieve some level of standardization when describing computation over >> Arrow data. >> >> At the moment on the Rust side DataFusion clearly has a way to describe >> computation, I believe that Ballista adds the ability to serialize this to >> allow distributed computation. On the C++ side work is starting on a >> similar query engine and we already have Gandiva. Is there an opportunity >> to define a kind of IR for computation over Arrow data that could be >> adopted across implementations? >> >> In this case DataFusion could easily incorporate Gandiva to generate >> optimized compute kernels if they were using the same IR to describe >> computation. Applications built on Arrow could "describe" computation in >> any language and take advantage or innovations across the community, adding >> this to Arrow's zero copy data sharing could be a game changer in my mind. >> I'm not someone who knows enough to drive this forward but I obviously >> would like to get involved. For some time I was playing around with using >> TVM's relay IR [1] and applying it to Arrow data. >> >> As the Arrow memory format has now matured I fell like this could be the >> next step. Is there any plan for this kind of work or are we going to >> allow sub-projects to "go their own way"? >> >> Thanks, >> Paddy >> >> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (apache.org)< >> https://tvm.apache.org/docs/dev/relay_intro.html> >> >>