Re: [DISCUSS] How to describe computation on Arrow data?

Jorge Cardoso Leitão Thu, 18 Mar 2021 11:19:20 -0700

Hi,

The main benefit I see for a standard for queries would not be on a
serialization format, but on its semantics.


IMO one of the main reasons for a lack of a standard of queries at the
protobuf level is that human-readability vastly outweighs serialization -
queries are at very most a megabyte in size, and SQL (or json or toml) is
often enough, and much easier to read, share, learn, etc.

With that said, imo a core challenge of SQL today is that every engine
implements small semantic variations, which often derive from a myriad of
tradeoffs that the different engines have to make, often consequent of how
each of them represents data in-memory.

Because arrow imposes a fixed in-memory format, those tradeoffs are more
likely to be similar across implementations, and thus semantic parity is
easier to spec. (The counter argument is that different engines have
different goals / use-cases, and these use-cases induce different querying
patterns and thus different tradeoffs)

The use-case I see here would be to have an alignment on a SQL dialect and
require implementations to return the same *result* over a given IPC file;
I.e.

SQL statement + IPC files = IPC file

would hold true for every implementation.

Best,
Jorge


On Thu, Mar 18, 2021 at 5:22 PM Andrew Lamb <[email protected]> wrote:

> Any higher level physical execution plan most likely needs a way to
> represent expressions. Thus focusing initially on a standard for
> expressions might be a good way to add value but keep the scope of the
> effort reasonable
>
> On Thu, Mar 18, 2021 at 11:49 AM Micah Kornfield <[email protected]>
> wrote:
>
> > I think there might be discussion on two levels of computation, physical
> > query execution plans, and potentially something "lower level"?  When
> this
> > has come up in the past, I was a little skeptical of constraining every
> SDK
> > to use the same description, so I agree with Wes's point about keeping
> any
> > spec open in the short term.  Ballista as an opt-in model, does sound
> like
> > possibly the right approach.
> >
> > I might be misunderstanding, but I think Weld [1] is another project
> > targeting the lower level components?
> >
> > Also, I think there was a little bit of effort to come up with a common
> > expression representation within C++, but got stalled on whether to use
> the
> > Gandiva expression descriptions or Flatbuffers, I can't seem to find the
> > thread/JIRA/discussion on this.  I'll try to look some more this evening.
> >
> > [1] https://github.com/weld-project/weld
> >
> > On Thu, Mar 18, 2021 at 7:53 AM Jed Brown <[email protected]> wrote:
> >
> > > I'm interested in providing some path to make this extensible. To pick
> an
> > > example, suppose the user wants to compute the first k principle
> > > components. We've talked [1] about the possibility of incorporating
> > richer
> > > communication semantics in Ballista (a la MPI sub-communicators) and
> > > numerical algorithms such as PCA would benefit. Those specific
> algorithms
> > > wouldn't belong in Arrow or Ballista core, but I think there's an
> > > opportunity for plugins to offer this sort of capability and it would
> be
> > > lovely if the language-independent protocol could call them. Do you
> see a
> > > good way to do this via ballista.proto?
> > >
> > > [1] https://github.com/ballista-compute/ballista/issues/303
> > >
> > > Andy Grove <[email protected]> writes:
> > >
> > > > Hi Paddy,
> > > >
> > > > Thanks for raising this.
> > > >
> > > > Ballista defines computations using protobuf [1] to describe logical
> > and
> > > > physical query plans, which consist of operators and expressions. It
> is
> > > > actually based on the Gandiva protobuf [2] for describing
> expressions.
> > > >
> > > > I see a lot of value in standardizing some of this across
> > > implementations.
> > > > Ballista is essentially becoming a distributed scheduler for Arrow
> and
> > > can
> > > > work with any implementation that supports this protobuf definition
> of
> > > > query plans.
> > > >
> > > > It would also make it easier to embed C++ in Rust, or Rust in C++,
> > having
> > > > this common IR, so I would be all for having something like this as
> an
> > > > Arrow specification.
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
> > > > [2]
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> > > >
> > > >
> > > > On Thu, Mar 18, 2021 at 7:40 AM paddy horan <[email protected]>
> > > wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> I do not have a computer science background so I may not be asking
> > this
> > > in
> > > >> the correct way or using the correct terminology but I wonder if we
> > can
> > > >> achieve some level of standardization when describing computation
> over
> > > >> Arrow data.
> > > >>
> > > >> At the moment on the Rust side DataFusion clearly has a way to
> > describe
> > > >> computation, I believe that Ballista adds the ability to serialize
> > this
> > > to
> > > >> allow distributed computation.  On the C++ side work is starting on
> a
> > > >> similar query engine and we already have Gandiva.  Is there an
> > > opportunity
> > > >> to define a kind of IR for computation over Arrow data that could be
> > > >> adopted across implementations?
> > > >>
> > > >> In this case DataFusion could easily incorporate Gandiva to generate
> > > >> optimized compute kernels if they were using the same IR to describe
> > > >> computation.  Applications built on Arrow could "describe"
> computation
> > > in
> > > >> any language and take advantage or innovations across the community,
> > > adding
> > > >> this to Arrow's zero copy data sharing could be a game changer in my
> > > mind.
> > > >> I'm not someone who knows enough to drive this forward but I
> obviously
> > > >> would like to get involved.  For some time I was playing around with
> > > using
> > > >> TVM's relay IR [1] and applying it to Arrow data.
> > > >>
> > > >> As the Arrow memory format has now matured I fell like this could be
> > the
> > > >> next step.  Is there any plan for this kind of work or are we going
> to
> > > >> allow sub-projects to "go their own way"?
> > > >>
> > > >> Thanks,
> > > >> Paddy
> > > >>
> > > >> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (
> > apache.org
> > > )<
> > > >> https://tvm.apache.org/docs/dev/relay_intro.html>
> > > >>
> > > >>
> > >
> >
>

Re: [DISCUSS] How to describe computation on Arrow data?

Reply via email to