Re: [DISCUSS] How to describe computation on Arrow data?

Micah Kornfield Thu, 18 Mar 2021 08:49:19 -0700

I think there might be discussion on two levels of computation, physical
query execution plans, and potentially something "lower level"?  When this
has come up in the past, I was a little skeptical of constraining every SDK
to use the same description, so I agree with Wes's point about keeping any
spec open in the short term.  Ballista as an opt-in model, does sound like
possibly the right approach.


I might be misunderstanding, but I think Weld [1] is another project
targeting the lower level components?

Also, I think there was a little bit of effort to come up with a common
expression representation within C++, but got stalled on whether to use the
Gandiva expression descriptions or Flatbuffers, I can't seem to find the
thread/JIRA/discussion on this.  I'll try to look some more this evening.

[1] https://github.com/weld-project/weld

On Thu, Mar 18, 2021 at 7:53 AM Jed Brown <[email protected]> wrote:

> I'm interested in providing some path to make this extensible. To pick an
> example, suppose the user wants to compute the first k principle
> components. We've talked [1] about the possibility of incorporating richer
> communication semantics in Ballista (a la MPI sub-communicators) and
> numerical algorithms such as PCA would benefit. Those specific algorithms
> wouldn't belong in Arrow or Ballista core, but I think there's an
> opportunity for plugins to offer this sort of capability and it would be
> lovely if the language-independent protocol could call them. Do you see a
> good way to do this via ballista.proto?
>
> [1] https://github.com/ballista-compute/ballista/issues/303
>
> Andy Grove <[email protected]> writes:
>
> > Hi Paddy,
> >
> > Thanks for raising this.
> >
> > Ballista defines computations using protobuf [1] to describe logical and
> > physical query plans, which consist of operators and expressions. It is
> > actually based on the Gandiva protobuf [2] for describing expressions.
> >
> > I see a lot of value in standardizing some of this across
> implementations.
> > Ballista is essentially becoming a distributed scheduler for Arrow and
> can
> > work with any implementation that supports this protobuf definition of
> > query plans.
> >
> > It would also make it easier to embed C++ in Rust, or Rust in C++, having
> > this common IR, so I would be all for having something like this as an
> > Arrow specification.
> >
> > Thanks,
> >
> > Andy.
> >
> > [1]
> >
> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
> > [2]
> >
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> >
> >
> > On Thu, Mar 18, 2021 at 7:40 AM paddy horan <[email protected]>
> wrote:
> >
> >> Hi All,
> >>
> >> I do not have a computer science background so I may not be asking this
> in
> >> the correct way or using the correct terminology but I wonder if we can
> >> achieve some level of standardization when describing computation over
> >> Arrow data.
> >>
> >> At the moment on the Rust side DataFusion clearly has a way to describe
> >> computation, I believe that Ballista adds the ability to serialize this
> to
> >> allow distributed computation.  On the C++ side work is starting on a
> >> similar query engine and we already have Gandiva.  Is there an
> opportunity
> >> to define a kind of IR for computation over Arrow data that could be
> >> adopted across implementations?
> >>
> >> In this case DataFusion could easily incorporate Gandiva to generate
> >> optimized compute kernels if they were using the same IR to describe
> >> computation.  Applications built on Arrow could "describe" computation
> in
> >> any language and take advantage or innovations across the community,
> adding
> >> this to Arrow's zero copy data sharing could be a game changer in my
> mind.
> >> I'm not someone who knows enough to drive this forward but I obviously
> >> would like to get involved.  For some time I was playing around with
> using
> >> TVM's relay IR [1] and applying it to Arrow data.
> >>
> >> As the Arrow memory format has now matured I fell like this could be the
> >> next step.  Is there any plan for this kind of work or are we going to
> >> allow sub-projects to "go their own way"?
> >>
> >> Thanks,
> >> Paddy
> >>
> >> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (apache.org
> )<
> >> https://tvm.apache.org/docs/dev/relay_intro.html>
> >>
> >>
>

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to