Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Phillip Cloud Fri, 13 Aug 2021 14:26:40 -0700

Hey all,

Just wanted to give an update on the effort here.


Ben Kietzman has created an alternative proposal to the initial design [1].
It largely overlaps with the original, but differs in a few important ways:

* A big focus of the design is on flexibility, allowing producers,
consumers and ultimately end users of those systems the ability to define
custom operations in the graph.
* There are very few predefined relational operations (project, filter,
join and a handful of others).
* There are only 3 types of value expressions: literals, field references,
and function calls.
* The model of evaluation is one that requires a final sink node, to
indicate where the record batches from child relations end up (a file, a
socket, an in-memory buffer, etc).

I've added notes [2] to the original Google doc (under the Alternative
Design Notes subheading), and a few pseudocode examples.
Unfortunately, these went out of date as soon as Ben pushed the PR [3], so
I need to update those to reflect his changes. Regardless,
the design is broadly the same, so it should still give a good indication
of the details of the design.

There are a decent number of review comments on the original PR that I plan
to port over where they are still relevant.
I also plan on adding support for window functions either tonight or on
Monday.

Please review this design at your earliest convenience. Since there's a
fairly concrete set of types in flatbuffers that
we can look at, ideally we can center discussion around the details in the
PR.

Thanks!

[1]: https://github.com/apache/arrow/pull/10856
[2]:
https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
[3]: https://github.com/apache/arrow/pull/10934

On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jhyde.apa...@gmail.com> wrote:

> > Wes wrote:
> >
> > Supporting this kind of intra-application engine
> > heterogeneity is one of the motivations for the project.
>
> +1
>
> The data format is the natural interface between tasks. (Defining “task”
> here as “something that is programmed using the IR”.) That is Arrow’s
> strength.
>
> So I think the IR should describe what each task should do, and tasks
> should be fairly small. Not whole relational operators, operating on whole
> tables, but pieces of relational operators, operating on batches or
> sequences of batches.
>
> Elsethread, someone mentioned the LoLePop concept and the Kohn/Leis/Neuman
> paper [1]. The LoLePop concept sounds good for our purposes.
>
> Julian
>
> [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
>
>
> > On Aug 12, 2021, at 5:19 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cpcl...@gmail.com>
> wrote:
> >>
> >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> >> jorgecarlei...@gmail.com> wrote:
> >>
> >>> Couple of questions
> >>>
> >>> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
> >>> operation "(IR,data) - engine -> result" MUST be the same for all
> "engine"?
> >>>
> >>
> >> I think that might be a non-starter for mundane reasons: there's
> probably
> >> at least two engines
> >> that disagree on the result of sum(x) where x is a floating point
> column.
> >>
> >>
> >>> 2. if yes, imo we may need to worry about:
> >>> * a definition of equality that implementations agree on.
> >>> * agreement over what the semantics look like. For example, do we use
> >>> kleene logic for AND and OR?
> >>>
> >>
> >> WRT Kleene logic my thoughts are that the IR should support both Kleene
> and
> >> non-Kleene,
> >> and producers can choose their desired semantics.
> >>
> >> Ibis for example, would override the `&` operator in `a & b` to produce
> >> `KleeneAnd(Column(a), Column(b))`.
> >>
> >>
> >>>
> >>> To try to understand the gist, let's pick an aggregated count over a
> >>> column: engines often do partial counts over partitions followed by a
> final
> >>> "sum" over the partial counts. Is the idea that the query engine would
> >>> communicate with the compute engine via 2 IRs where one is "count me
> these"
> >>> the other is "sum me these"?
> >>>
> >>> Best,
> >>> Jorge
> >>>
> >>
> >> Not in its current incarnation.
> >>
> >> The idea is that the IR producer communicates a desire to count(x) to a
> >> consumer, and  it's up to the consumer to figure out how to turn that
> count
> >> into something that makes sense for itself. In your example that's a
> series
> >> of partial counts followed by a sum.
> >>
> >
> > That said, I think there is a valid use case here where a system might
> > make use of different engines to execute different (composable) layers
> > of a complex query.
> >
> > For example:
> >
> > * suppose you want to scan and do predicate pushdown on an unusual
> > data source that is only accessible from one particular engine but
> > * you need to do some analytical operation with the scan results that
> > is only supported by another engine
> >
> > You could decompose the query into two stages with an IR relational
> > expression for each stage and use then the engines together to
> > accomplish what you need (of course, you would need an orchestration
> > layer to handle plumbing the query engine inputs and outputs together
> > as Arrow streams). Supporting this kind of intra-application engine
> > heterogeneity is one of the motivations for the project.
> >
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cpcl...@gmail.com>
> wrote:
> >>>
> >>>> Thanks Wes,
> >>>>
> >>>> Great to be back working on Arrow again and engaging with the
> community.
> >>> I
> >>>> am really excited about this effort.
> >>>>
> >>>> I think there are a number of concerns I see as important to address
> in
> >>> the
> >>>> compute IR proposal:
> >>>>
> >>>> 1. Requirement for output types.
> >>>>
> >>>> I think that so far there's been many reasons for requiring
> conforming IR
> >>>> producers and consumers to adhere to output types, but I haven't seen
> a
> >>>> strong rationale for keeping them optional (in the semantic sense, not
> >>> WRT
> >>>> any particular serialization format's representation of optionality).
> >>>>
> >>>> I think a design that includes unambiguous semantics for output types
> (a
> >>>> consumer must produce a value of the requested type or it's an
> >>>> error/non-conforming) is simpler to reason about for producers, and
> >>>> provides a strong guarantee for end users (humans or machines
> >>> constructing
> >>>> IR from an API and expecting the thing they ask for back from the IR
> >>>> consumer).
> >>>>
> >>>> 2. Flexibility
> >>>>
> >>>> The current PR is currently unable to support what I think are killer
> >>>> features of the IR: custom operators (relational or column) and UDFs.
> In
> >>> my
> >>>> mind, on top of the generalized compute description that the IR
> offers,
> >>> the
> >>>> ability for producers and consumers of IR to extend the IR without
> >>> needing
> >>>> to modify Arrow or depend on anything except the format is itself
> >>> something
> >>>> that is necessary to gain adoption.
> >>>>
> >>>> Developers will need to build custom relational operators (e.g.,
> scans of
> >>>> backends that don't exist anywhere for which a user has code to
> >>> implement)
> >>>> and custom functions (anything operating on a column that doesn't
> already
> >>>> exist, really). Furthermore, I think we can actually drive building an
> >>>> Arrow consumer using the same API that an end user would use to extend
> >>> the
> >>>> IR.
> >>>>
> >>>> 3. Window Functions
> >>>>
> >>>> Window functions are, I think, an important part of the IR value
> >>>> proposition, as they are one of the more complex operators in
> databases.
> >>> I
> >>>> think we need to have something in the initial IR proposal to support
> >>> these
> >>>> operations.
> >>>>
> >>>> 4. Non relational Joins
> >>>>
> >>>> Things like as-of join and window join operators aren't yet fleshed
> out
> >>> in
> >>>> the IR, and I'm not sure they should be in scope for the initial
> >>> prototype.
> >>>> I think once we settle on a design, we can work the design of these
> >>>> particular operators out during the initial prototype. I think the
> >>>> specification of these operators should basically be PR #2 after the
> >>>> initial design lands.
> >>>>
> >>>> # Order of Work
> >>>>
> >>>> 1. Nail down the design. Anything else is a non-starter.
> >>>>
> >>>> 2. Prototype an IR producer using Ibis
> >>>>
> >>>> Ibis is IMO a good candidate for a first IR producer as it has a
> number
> >>> of
> >>>> desirable properties that make prototyping faster and allow for us to
> >>>> refine the design of the IR as needed based on how the implementation
> >>> goes:
> >>>> * It's written in Python so it has native support for nearly all of
> >>>> flatbuffers' features without having to creating bindings to C++.
> >>>> * There's already a set of rules for type checking, as well as APIs
> for
> >>>> constructing expression trees, which means we don't need to worry
> about
> >>>> building a type checker for the prototype.
> >>>>
> >>>> 3. Prototype an IR consumer in C++
> >>>>
> >>>> I think in parallel to the producer prototype we can further inform
> the
> >>>> design from the consumer side by prototyping an IR consumer in C++ . I
> >>> know
> >>>> Ben Kietzman has expressed interest in working on this.
> >>>>
> >>>> Very interested to hear others' thoughts.
> >>>>
> >>>> -Phillip
> >>>>
> >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> Thank you for all the feedback and comments on the document. I'm on
> >>>>> vacation this week, so I'm delayed responding to everything, but I
> >>>>> will get to it as quickly as I can. I will be at VLDB in Copenhagen
> >>>>> next week if anyone would like to chat in person about it, and we can
> >>>>> relay the content of any discussions back to the document/PR/e-mail
> >>>>> thread.
> >>>>>
> >>>>> I know that Phillip Cloud expressed interest in working on the PR and
> >>>>> helping work through many of the details, so I'm glad to have the
> >>>>> help. If there are others who would like to work on the PR or dig
> into
> >>>>> the details, please let me know. We might need to figure out how to
> >>>>> accommodate "many cooks" by setting up the ComputeIR project
> somewhere
> >>>>> separate from the format/ directory to permit it to exist in a
> >>>>> Work-In-Progress status for a period of time until we work through
> the
> >>>>> various details and design concerns.
> >>>>>
> >>>>> Re Julian's comment
> >>>>>
> >>>>>> The biggest surprise is that this language does full relational
> >>>>> operations. I was expecting that it would do fragments of the
> >>> operations.
> >>>>>
> >>>>> There's a related but different (yet still interesting and worthy of
> >>>>> analysis) problem of creating an "engine language" that describes
> more
> >>>>> mechanically the constituent parts of implementing the relational
> >>>>> operators. To create a functional computation language with concrete
> >>>>> Arrow data structures as a top-level primitive sounds like an
> >>>>> interesting research area where I could see something developing
> >>>>> eventually.
> >>>>>
> >>>>> The main problem I'm interested in solving right now is enabling
> front
> >>>>> ends that have sufficient understanding of relational algebra and
> data
> >>>>> frame operations to talk to engines without having to go backwards
> >>>>> from their logical query plans to SQL. So as mentioned in the
> >>>>> document, being able to faithfully carry the relational operator node
> >>>>> information generated by Calcite or Ibis or another system would be
> >>>>> super useful. Defining the semantics of various kinds of user-defined
> >>>>> functions would also be helpful to standardize the
> >>>>> engine-to-user-language UDF/extension interface.
> >>>>>
> >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <alen...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> Hi Wes,
> >>>>>>
> >>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics
> From
> >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> >>>> proposes a
> >>>>>> set of building block for advanced aggregation.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Dimitri.
> >>>>>>
> >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jhyde.apa...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Wes,
> >>>>>>>
> >>>>>>> Thanks for this. I’ve added comments to the doc and to the PR.
> >>>>>>>
> >>>>>>> The biggest surprise is that this language does full relational
> >>>>>>> operations. I was expecting that it would do fragments of the
> >>>>> operations.
> >>>>>>> Consider join. A distributed hybrid hash join needs to partition
> >>> rows
> >>>>> into
> >>>>>>> output buffers based on a hash key, build hash tables, probe into
> >>>> hash
> >>>>>>> tables, scan hash tables for untouched “outer”rows, and so forth.
> >>>>>>>
> >>>>>>> I see Arrow’s compute as delivering each of those operations,
> >>> working
> >>>>> on
> >>>>>>> perhaps a batch at a time, or a sequence of batches, with some
> >>> other
> >>>>> system
> >>>>>>> coordinating those tasks. So I would expect to see Arrow’s compute
> >>>>> language
> >>>>>>> mainly operating on batches rather than a table abstraction.
> >>>>>>>
> >>>>>>> Julian
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmck...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> hi folks,
> >>>>>>>>
> >>>>>>>> This idea came up in passing in the past -- given that there are
> >>>>>>>> multiple independent efforts to develop Arrow-native query
> >>> engines
> >>>>>>>> (and surely many more to come), it seems like it would be
> >>> valuable
> >>>> to
> >>>>>>>> have a way to enable user languages (like Java, Python, R, or
> >>> Rust,
> >>>>>>>> for example) to communicate with backend computing engines (like
> >>>>>>>> DataFusion, or new computing capabilities being built in the
> >>> Arrow
> >>>>> C++
> >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> >>>> specialized
> >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> >>> parser /
> >>>>>>>> analyzer framework to generate an expression tree of relational
> >>>>>>>> operators and then translate that to an Arrow-native
> >>>> compute-engine's
> >>>>>>>> internal grammer, a user framework could provide the desired
> >>>>>>>> Arrow-native expression tree / data manipulations directly and
> >>> skip
> >>>>>>>> the SQL altogether.
> >>>>>>>>
> >>>>>>>> The idea of creating a "serialized intermediate representation
> >>>> (IR)"
> >>>>>>>> for Arrow compute operations would be to serve use cases large
> >>> and
> >>>>>>>> small -- from the most complex TPC-* or time series database
> >>> query
> >>>> to
> >>>>>>>> the most simple array predicate/filter sent with an RPC request
> >>>> using
> >>>>>>>> Arrow Flight. It is deliberately language- and engine-agnostic,
> >>>> with
> >>>>>>>> the only commonality across usages being the Arrow columnar
> >>> format
> >>>>>>>> (schemas and array types). This would be better than leaving it
> >>> to
> >>>>>>>> each application to develop its own bespoke expression
> >>>>> representations
> >>>>>>>> for its needs.
> >>>>>>>>
> >>>>>>>> I spent a while thinking about this and wrote up a brain dump RFC
> >>>>>>>> document [1] and accompanying pull request [2] that makes the
> >>>>> possibly
> >>>>>>>> controversial choice of using Flatbuffers to represent the
> >>>> serialized
> >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in the
> >>>> RFC
> >>>>>>>> document. This PR is obviously deficient in many regards
> >>>> (incomplete,
> >>>>>>>> hacky, or unclear in places), and will need some help from others
> >>>> to
> >>>>>>>> flesh out. I suspect that actually implementing the IR will be
> >>>>>>>> necessary to work out many of the low-level details.
> >>>>>>>>
> >>>>>>>> Note that this IR is intended to be more of a "superset" project
> >>>> than
> >>>>>>>> a "lowest common denominator". So there may be things that it
> >>>>> includes
> >>>>>>>> which are only available in some engines (e.g. engines that have
> >>>>>>>> specialized handling of time series data).
> >>>>>>>>
> >>>>>>>> As some of my personal motivation for the project, concurrent
> >>> with
> >>>>> the
> >>>>>>>> genesis of Apache Arrow, I started a Python project called Ibis
> >>> [3]
> >>>>>>>> (which is similar to R's dplyr project) which serves as a "Python
> >>>>>>>> analytical query IR builder" that is capable of generating most
> >>> of
> >>>>> the
> >>>>>>>> SQL standard, targeting many different SQL dialects and other
> >>>>> backends
> >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this
> >>>>> library
> >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able to
> >>>>>>>> translate between the in-memory "logical query" data structures
> >>> in
> >>>> a
> >>>>>>>> library like Ibis to a serialized format that can be executed by
> >>>> many
> >>>>>>>> different Arrow-native query engines. The expression primitives
> >>>>>>>> available in Ibis should serve as helpful test cases, too.
> >>>>>>>>
> >>>>>>>> I look forward to the community's comments on the RFC document
> >>> [1]
> >>>>> and
> >>>>>>>> pull request [2] -- I realize that arriving at consensus on a
> >>>> complex
> >>>>>>>> and ambitious project like this can be challenging so I recommend
> >>>>>>>> spending time on the "non-goals" section in the RFC and ask
> >>>> questions
> >>>>>>>> if you are unclear about the scope of what problems this is
> >>> trying
> >>>> to
> >>>>>>>> solve. I would be happy to give Edit access on the RFC document
> >>> to
> >>>>>>>> others and would consider ideas about how to move forward with
> >>>>>>>> something that is able to be implemented by different Arrow
> >>>> libraries
> >>>>>>>> in the reasonably near future.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Wes
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> >>>>>>>> [3]: https://ibis-project.org/
> >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> >>>>>>>> [5]:
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
>
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Reply via email to