Re: [DISCUSS] How to describe computation on Arrow data?

Jed Brown Thu, 18 Mar 2021 07:46:11 -0700

I'm interested in providing some path to make this extensible. To pick an 
example, suppose the user wants to compute the first k principle components. 
We've talked [1] about the possibility of incorporating richer communication 
semantics in Ballista (a la MPI sub-communicators) and numerical algorithms 
such as PCA would benefit. Those specific algorithms wouldn't belong in Arrow 
or Ballista core, but I think there's an opportunity for plugins to offer this 
sort of capability and it would be lovely if the language-independent protocol 
could call them. Do you see a good way to do this via ballista.proto?


[1] https://github.com/ballista-compute/ballista/issues/303

Andy Grove <andygrov...@gmail.com> writes:

> Hi Paddy,
>
> Thanks for raising this.
>
> Ballista defines computations using protobuf [1] to describe logical and
> physical query plans, which consist of operators and expressions. It is
> actually based on the Gandiva protobuf [2] for describing expressions.
>
> I see a lot of value in standardizing some of this across implementations.
> Ballista is essentially becoming a distributed scheduler for Arrow and can
> work with any implementation that supports this protobuf definition of
> query plans.
>
> It would also make it easier to embed C++ in Rust, or Rust in C++, having
> this common IR, so I would be all for having something like this as an
> Arrow specification.
>
> Thanks,
>
> Andy.
>
> [1]
> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
> [2]
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
>
>
> On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com> wrote:
>
>> Hi All,
>>
>> I do not have a computer science background so I may not be asking this in
>> the correct way or using the correct terminology but I wonder if we can
>> achieve some level of standardization when describing computation over
>> Arrow data.
>>
>> At the moment on the Rust side DataFusion clearly has a way to describe
>> computation, I believe that Ballista adds the ability to serialize this to
>> allow distributed computation.  On the C++ side work is starting on a
>> similar query engine and we already have Gandiva.  Is there an opportunity
>> to define a kind of IR for computation over Arrow data that could be
>> adopted across implementations?
>>
>> In this case DataFusion could easily incorporate Gandiva to generate
>> optimized compute kernels if they were using the same IR to describe
>> computation.  Applications built on Arrow could "describe" computation in
>> any language and take advantage or innovations across the community, adding
>> this to Arrow's zero copy data sharing could be a game changer in my mind.
>> I'm not someone who knows enough to drive this forward but I obviously
>> would like to get involved.  For some time I was playing around with using
>> TVM's relay IR [1] and applying it to Arrow data.
>>
>> As the Arrow memory format has now matured I fell like this could be the
>> next step.  Is there any plan for this kind of work or are we going to
>> allow sub-projects to "go their own way"?
>>
>> Thanks,
>> Paddy
>>
>> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (apache.org)<
>> https://tvm.apache.org/docs/dev/relay_intro.html>
>>
>>

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to