Re: [DISCUSS] How to describe computation on Arrow data?

Antoine Pitrou Fri, 19 Mar 2021 09:37:45 -0700

If we want this format to be common to different execution engines thenit seems like it should represent logical expressions indeed (which maybe implemented by different physical operators, depending on theexecution engine). But I'm no expert in the matter.


Regards

Antoine.


Le 18/03/2021 à 17:22, Andrew Lamb a écrit :

Any higher level physical execution plan most likely needs a way to
represent expressions. Thus focusing initially on a standard for
expressions might be a good way to add value but keep the scope of the
effort reasonable

On Thu, Mar 18, 2021 at 11:49 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

I think there might be discussion on two levels of computation, physical
query execution plans, and potentially something "lower level"?  When this
has come up in the past, I was a little skeptical of constraining every SDK
to use the same description, so I agree with Wes's point about keeping any
spec open in the short term.  Ballista as an opt-in model, does sound like
possibly the right approach.

I might be misunderstanding, but I think Weld [1] is another project
targeting the lower level components?

Also, I think there was a little bit of effort to come up with a common
expression representation within C++, but got stalled on whether to use the
Gandiva expression descriptions or Flatbuffers, I can't seem to find the
thread/JIRA/discussion on this.  I'll try to look some more this evening.

[1] https://github.com/weld-project/weld

On Thu, Mar 18, 2021 at 7:53 AM Jed Brown <j...@jedbrown.org> wrote:

I'm interested in providing some path to make this extensible. To pick an
example, suppose the user wants to compute the first k principle
components. We've talked [1] about the possibility of incorporating

richer

communication semantics in Ballista (a la MPI sub-communicators) and
numerical algorithms such as PCA would benefit. Those specific algorithms
wouldn't belong in Arrow or Ballista core, but I think there's an
opportunity for plugins to offer this sort of capability and it would be
lovely if the language-independent protocol could call them. Do you see a
good way to do this via ballista.proto?

[1] https://github.com/ballista-compute/ballista/issues/303

Andy Grove <andygrov...@gmail.com> writes:

Hi Paddy,

Thanks for raising this.

Ballista defines computations using protobuf [1] to describe logical

and

physical query plans, which consist of operators and expressions. It is
actually based on the Gandiva protobuf [2] for describing expressions.

I see a lot of value in standardizing some of this across

implementations.

Ballista is essentially becoming a distributed scheduler for Arrow and

can

work with any implementation that supports this protobuf definition of
query plans.

It would also make it easier to embed C++ in Rust, or Rust in C++,

having

this common IR, so I would be all for having something like this as an
Arrow specification.

Thanks,

Andy.

[1]

https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto

[2]

https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto



On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com>

wrote:

Hi All,

I do not have a computer science background so I may not be asking

this

in

the correct way or using the correct terminology but I wonder if we

can

achieve some level of standardization when describing computation over
Arrow data.

At the moment on the Rust side DataFusion clearly has a way to

describe

computation, I believe that Ballista adds the ability to serialize

this

to

allow distributed computation.  On the C++ side work is starting on a
similar query engine and we already have Gandiva.  Is there an

opportunity

to define a kind of IR for computation over Arrow data that could be
adopted across implementations?

In this case DataFusion could easily incorporate Gandiva to generate
optimized compute kernels if they were using the same IR to describe
computation.  Applications built on Arrow could "describe" computation

in

any language and take advantage or innovations across the community,

adding

this to Arrow's zero copy data sharing could be a game changer in my

mind.

I'm not someone who knows enough to drive this forward but I obviously
would like to get involved.  For some time I was playing around with

using

TVM's relay IR [1] and applying it to Arrow data.

As the Arrow memory format has now matured I fell like this could be

the

next step.  Is there any plan for this kind of work or are we going to
allow sub-projects to "go their own way"?

Thanks,
Paddy

[1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (

apache.org

)<

https://tvm.apache.org/docs/dev/relay_intro.html>

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to