Re: [DISCUSS] How to describe computation on Arrow data?

Brian Hulette Thu, 18 Mar 2021 07:49:48 -0700

I agree this would be a great development. It would also be useful for
leveraging compute engines from JS via wasm.


I've thought about something like this in the context of multi-language
relational workloads in Apache Beam, mostly just leading me to wonder if
something like it already exists. But so far I haven't found it.

On Thu, Mar 18, 2021 at 7:39 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I completely agree with developing a common “query protocol” or “physical
> execution plan” IR + serialization scheme inside Apache Arrow. It may take
> some time to stabilize so we should try to avoid being hasty in closing it
> to change until more time has elapsed to allow requirements to percolate.
>
> On Thu, Mar 18, 2021 at 8:17 AM Andy Grove <andygrov...@gmail.com> wrote:
>
> > Hi Paddy,
> >
> > Thanks for raising this.
> >
> > Ballista defines computations using protobuf [1] to describe logical and
> > physical query plans, which consist of operators and expressions. It is
> > actually based on the Gandiva protobuf [2] for describing expressions.
> >
> > I see a lot of value in standardizing some of this across
> implementations.
> > Ballista is essentially becoming a distributed scheduler for Arrow and
> can
> > work with any implementation that supports this protobuf definition of
> > query plans.
> >
> > It would also make it easier to embed C++ in Rust, or Rust in C++, having
> > this common IR, so I would be all for having something like this as an
> > Arrow specification.
> >
> > Thanks,
> >
> > Andy.
> >
> > [1]
> >
> >
> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
> > [2]
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> >
> >
> > On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > I do not have a computer science background so I may not be asking this
> > in
> > > the correct way or using the correct terminology but I wonder if we can
> > > achieve some level of standardization when describing computation over
> > > Arrow data.
> > >
> > > At the moment on the Rust side DataFusion clearly has a way to describe
> > > computation, I believe that Ballista adds the ability to serialize this
> > to
> > > allow distributed computation.  On the C++ side work is starting on a
> > > similar query engine and we already have Gandiva.  Is there an
> > opportunity
> > > to define a kind of IR for computation over Arrow data that could be
> > > adopted across implementations?
> > >
> > > In this case DataFusion could easily incorporate Gandiva to generate
> > > optimized compute kernels if they were using the same IR to describe
> > > computation.  Applications built on Arrow could "describe" computation
> in
> > > any language and take advantage or innovations across the community,
> > adding
> > > this to Arrow's zero copy data sharing could be a game changer in my
> > mind.
> > > I'm not someone who knows enough to drive this forward but I obviously
> > > would like to get involved.  For some time I was playing around with
> > using
> > > TVM's relay IR [1] and applying it to Arrow data.
> > >
> > > As the Arrow memory format has now matured I fell like this could be
> the
> > > next step.  Is there any plan for this kind of work or are we going to
> > > allow sub-projects to "go their own way"?
> > >
> > > Thanks,
> > > Paddy
> > >
> > > [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (
> apache.org
> > )<
> > > https://tvm.apache.org/docs/dev/relay_intro.html>
> > >
> > >
> >
>

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to