Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Weston Pace Mon, 30 Aug 2021 16:30:18 -0700

My (incredibly naive) interpretation is that there are three problems to tackle.


1) How do you represent a graph and relational operators (join, union,
groupby, etc.)
 - The PR appears to be addressing this question fairly well
2) How does a frontend query a backend to know what UDFs are supported.
 - I don't see anything in the spec for this (some comments touch on
it) but it seems like it would be necessary to build any kind of
system.
3) Is there some well defined set of canonical UDFs that we can all
agree on the semantics for (e.g. addition, subtraction, etc.)
 - I thought, from earlier comments in this email thread, that the
goal was to avoid addressing this.  Although I think there is strong
value here as well.

So what is the scope of this initiative?  If it is just #1 for example
then I don't see any need to put types in the IR (and I've commented
as such in the PR).  From a relational perspective isn't a UDF just a
black box Table -> UDF -> Table?

On Mon, Aug 30, 2021 at 11:10 AM Phillip Cloud <cpcl...@gmail.com> wrote:
>
> Hey everyone,
>
> There's some interesting discussion around types and where their location
> is in the current PR [1] (and in fact whether to store them at all).
>
> It would be great to get some community feedback on this [2] part of the PR
> in particular, because the choice of whether to store types at all has
> important design consequences.
>
> [1]: https://github.com/apache/arrow/pull/10934
> [2]: https://github.com/apache/arrow/pull/10934/files#r697025313
>
> On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > As an FYI, Iceberg is also considering an IR in relation to view support
> > [1].  I chimed in and pointed them to this thread and Wes's doc.  Phillip
> > and Jacques chimed in there as well.
> >
> > [1]
> >
> > https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E
> >
> > On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <cpcl...@gmail.com> wrote:
> >
> > > Thanks for the feedback Jacques, very helpful. In the latest version of
> > the
> > > PR, I've tried to incorporate nearly all of these points.
> > >
> > > - I've incorporated most of what you had for dereferencing operations
> > into
> > > the PR, and gotten rid of schemas except on Read/Write relations.
> > > - With respect to properties, I've made a bunch more specific operators,
> > > and kept user-defined things special case-y.
> > > - I haven't incorporated anything close to physical-plan things, but I
> > > think that's a good follow up PR. Having separate representations for
> > > logical/physical plans seems like a waste of effort ultimately. I think
> > we
> > > can find a good balance.
> > > - Agree on UDF support, I think that will have to evolve as the rest of
> > the
> > > spec evolves. UDFs will need language-dedicated effort given the large
> > > variety of languages that folks will want to use to define functions.
> > >
> > > On a separate note, in an effort to move this project forward I'd like to
> > > do one final round of code review from anyone interested and then merge
> > the
> > > PR after that.
> > > This spec will be unstable for a while, until we can work out all the
> > > design kinks and edge cases, and I think getting this initial spec in is
> > > important to start that process.
> > >
> > >
> > > On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <jacq...@apache.org>
> > wrote:
> > >
> > > > In a lucky turn of events, Phillip actually turned out to be in my neck
> > > of
> > > > the woods on Friday so we had a chance to sit down and discuss this. To
> > > > help, I actually shared something I had been working on a few months
> > ago
> > > > independently (before this discussion started).
> > > >
> > > > For reference:
> > > > Wes PR: https://github.com/apache/arrow/pull/10856
> > > > Ben PR: https://github.com/apache/arrow/pull/10934
> > > > Jacques PR: https://github.com/apache/arrow/pull/10979
> > > >
> > > > The high level points of feedback I have are:
> > > >
> > > >    - Ben PR feels too deconstructed. While I like the elegance and
> > > >    symmetry, I believe this will lead to substantially more work in
> > > moving
> > > >    from serialization format to something closer to what a system would
> > > > want
> > > >    to manipulate/consume. The reality is that there are a lot of really
> > > > known
> > > >    things and specializing the representation for these things will
> > > > ultimately
> > > >    make things easier to program with without error and easier to
> > debug.
> > > > (For
> > > >    example, imagine trying to inspect a plan in a debugging session
> > with
> > > > the
> > > >    Ben representation.) We should embrace the known things in the
> > > >    specification.
> > > >    - I believe that it is a mistake for the inner workings of the plan
> > to
> > > >    ever use field names. Only input (e.g. file read) and Output (e.g.
> > > > return
> > > >    to user or write to file) need to have field names. For the rest of
> > > the
> > > >    system, using field ordinals (determinant whether nested or flat) is
> > > > much
> > > >    cleaner and is how most execution systems work. For example, in
> > > Impala I
> > > >    believe it is called a slot. As I noted in the PR, Calcite as an
> > > > example is
> > > >    entirely ordinal based at the algebra level. Rowtypes contain field
> > > > names
> > > >    but they are actually basically pointless. Field references use
> > > > RexInputRef
> > > >    with ordinal based and rules around column order output (e.g. what
> > is
> > > > the
> > > >    field order of a join output) are determinant and done entirely at
> > an
> > > >    ordinal level. The only place where names should be used (besides
> > > >    input/output) is in the case of map keys. In that case, the names
> > are
> > > >    actually data, as opposed to scheme metadata. This is why I propose
> > a
> > > >    strongly structured dereference operation [1].
> > > >    - Properties should only be included in the serialization if they
> > are
> > > >    not easily re-derivable at plan consumption time. For example,
> > you'll
> > > > note
> > > >    that I don't store schema information for relational operation. Each
> > > >    function and relational operation should already know how a given
> > > input
> > > > is
> > > >    transformed to a given output. Capturing this information in the
> > > > plan/IR is
> > > >    excessive. In many ways, I compare it to the early use of
> > VectorLayout
> > > > [2]
> > > >    in Arrow schema. It may have provided some additional checksum of
> > the
> > > >    message but ultimately it caused more pain than it was worth (and
> > thus
> > > > we
> > > >    removed it before formalizing the specification). For reference, in
> > > the
> > > >    context of Dremio, we used to actually do this, send schema
> > > information
> > > >    around for all operations. We removed it because in many cases
> > > becoming
> > > > the
> > > >    majority of our internal plan serialization (imagine simple
> > operations
> > > > that
> > > >    are carrying 1000s of fields).
> > > >    - I suggest focusing on support for both logical and physical
> > > >    representations. The moment you start talking about optimization
> > > passes,
> > > >    many of those would probably be better being done at the logical
> > > level.
> > > > The
> > > >    overlap is really high.
> > > >    - I think a lot more work must be done before introducing UDFs and
> > > user
> > > >    defined relational operations. I see one goal being the possibility
> > of
> > > >    there being three systems: A -> B -> C. A is a IR producer. C is a
> > IR
> > > >    consumer and B is a IR filter or translator. In this situation, B
> > > > should be
> > > >    able to operate and do optimizations on a plan even if if there are
> > > > black
> > > >    box user defined operations. Being able to know the
> > > > properties-preservation
> > > >    or not of these operations is important to making decisions. For
> > > > example,
> > > >    does a user defined relational operation maintain sortedness?
> > > > Distribution?
> > > >    Is a defined UDF idempotent? As such, I think the definition of
> > those
> > > > black
> > > >    boxes should be much more structured. For example: it is a python
> > > >    relational operation named X stored in Y that maintains properties
> > 1,2
> > > > and
> > > >    disrupts property 3. Putting just a black box of bytes will
> > > > substantially
> > > >    reduce the compatibility and extensibility of the ecosystem of tools
> > > >    working against IR. I'd note that I wouldn't expect this to be a
> > > burden
> > > > to
> > > >    actual end users. By using sensible defaults, I still would expect
> > an
> > > > end
> > > >    user tool to support arbitrary user defined operations.
> > > >    - It might make sense to review the XML representation that Orca
> > uses
> > > >    [3]. I haven't looked at it recently but they had a strong goal of
> > > >    decoupling for most of its life (to use in both Greenplum and Hawq).
> > > It
> > > >    could be the most mature/formal serialization of query plans
> > > publically
> > > >    available.
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> > > > [2]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> > > > [3] https://github.com/greenplum-db/gporca
> > > >
> > > >
> > > > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <cpcl...@gmail.com>
> > > wrote:
> > > >
> > > > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <wesmck...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Looking at Ben's alternate PR [1], having an IR that leans heavily
> > on
> > > > > > memory references to an out-of-band data sidecar seems like an
> > > > > > approach that would substantially ratchet up the implementation
> > > > > > complexity as producing the IR would then have the level of
> > > complexity
> > > > > > of producing the Arrow IPC format — when producing the "root" Plan
> > > > > > message, you must accumulate a list of the dependent serialized
> > > > > > submessages, placing the appropriate Buffer memory offset in the
> > Plan
> > > > > > message, like we do when producing the RecordBatch.buffers field.
> > > This
> > > > > > seems complicated to me as you must devise a custom binary protocol
> > > to
> > > > > > concatenate the serialized Plan and the messages it depends on
> > into a
> > > > > > single binary payload
> > > > > >
> > > > > > <ROOT PLAN>
> > > > > > <padding>
> > > > > > <Buffer 0>
> > > > > > <padding>
> > > > > > <Buffer 1>
> > > > > > <padding>
> > > > > > ...
> > > > > > <Buffer N - 1>
> > > > > >
> > > > > > (one purpose of FlatBufferBuilder is to spare you having to do this
> > > > > > yourself — some reasons we do it for the Arrow IPC format is
> > because
> > > > > > appending Arrow memory buffers directly to a FlatBufferBuilder
> > would
> > > > > > be inefficient — internal realloc calls — and Flatbuffers are
> > limited
> > > > > > to 2GB. Neither of these things are problems here)
> > > > > >
> > > > > > In general, I believe the output created by an IR producer should
> > be
> > > a
> > > > > > single serialized object without any out-of-band data sidecar —
> > this
> > > > > > is much simpler for implementers and we can still provide an
> > "escape
> > > > > > hatch" for user-defined operators and functions where the required
> > > > > > function/operator is passed opaquely as an embedded binary data.
> > > > >
> > > > >
> > > > >
> > > > > The serialization format (whether it is Flatbuffers or JSON, or
> > > > > > something else) should allow for data memoization, so if there is a
> > > > > > user-defined operator/function, or a relation that is used multiple
> > > > > > times throughout the query (potentially with a large schema), then
> > we
> > > > > > should ensure that the data need not be duplicated in the
> > > > > > serialization format unnecessarily — in Flatbuffers, you can
> > achieve
> > > > > > this by reusing offsets, but we could devise the data structures to
> > > > > > make the memoization of "expensive" objects more explicit.
> > > > > >
> > > > >
> > > > > I think this is something that would need to be explicitly encoded in
> > > > > the structures themselves if it's a hard requirement. I don't think
> > > this
> > > > > should block
> > > > > a prototype producer/consumer.
> > > > >
> > > > > Is there something in the second PR/design that precludes the reuse
> > of
> > > > > offsets?
> > > > > To my eye, the flatbuffers offset reuse mechanism works just as well
> > > > there.
> > > > >
> > > > >
> > > > > > I additionally think that it is important to provide as much
> > built-in
> > > > > > support for "canonical" operators/functions (such as the ones
> > > > > > implemented commonly by SQL engines) as possible, and to liberally
> > > > > > expand the catalogue of "built-in" capabilities. I would still
> > prefer
> > > > > > to have large unions/enums of built-in operators/functions and to
> > > > > > expand those unions/enums to accommodate new things when it is
> > > > > > demonstrated that there is a need to standardize things between
> > > > > > producers/consumers.
> > > > > >
> > > > >
> > > > > I think there's a middle ground where we add a bit of structure
> > > > (something
> > > > > like
> > > > > a descriptor from the first PR) to indicate whether a thing is
> > built-in
> > > > vs
> > > > > user-defined.
> > > > > It looks like Ben has pushed something like this to his PR.
> > > > >
> > > > > With that scheme, we have both flexibility and a small set of special
> > > > > builtins that make up
> > > > > a statically typed set for expressions and relational operators.
> > > > >
> > > > > I would really like to vet this PR with a prototype this week,
> > > > > to see whether we need to revisit any major choices. I don't think
> > > we'll
> > > > be
> > > > > able to
> > > > > anticipate all the consequences until we write some code.
> > > > >
> > > > >
> > > > > >
> > > > > > One of the beneficial properties of the Union/Enum approach for the
> > > > > > operator/function catalogues, is that when there are additions to
> > > > > > those enums, the generated Flatbuffers files will cause many
> > language
> > > > > > compilers to warn or error on unhandled enum cases. If all
> > > > > > function/operator names are strings, then you are essentially
> > > > > > reimplementing the functionality provided by enums by hand. I
> > > > > > initially used strings for all function references in my original
> > > > > > prototype, but I now think that using an enum for "built-ins" would
> > > be
> > > > > > superior (because of the code-generated enum interface) and not a
> > > > > > premature optimization.
> > > > > >
> > > > > > [1]: https://github.com/apache/arrow/pull/10934
> > > > > >
> > > > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cpcl...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > Hey all,
> > > > > > >
> > > > > > > Just wanted to give an update on the effort here.
> > > > > > >
> > > > > > > Ben Kietzman has created an alternative proposal to the initial
> > > > design
> > > > > > [1].
> > > > > > > It largely overlaps with the original, but differs in a few
> > > important
> > > > > > ways:
> > > > > > >
> > > > > > > * A big focus of the design is on flexibility, allowing
> > producers,
> > > > > > > consumers and ultimately end users of those systems the ability
> > to
> > > > > define
> > > > > > > custom operations in the graph.
> > > > > > > * There are very few predefined relational operations (project,
> > > > filter,
> > > > > > > join and a handful of others).
> > > > > > > * There are only 3 types of value expressions: literals, field
> > > > > > references,
> > > > > > > and function calls.
> > > > > > > * The model of evaluation is one that requires a final sink node,
> > > to
> > > > > > > indicate where the record batches from child relations end up (a
> > > > file,
> > > > > a
> > > > > > > socket, an in-memory buffer, etc).
> > > > > > >
> > > > > > > I've added notes [2] to the original Google doc (under the
> > > > Alternative
> > > > > > > Design Notes subheading), and a few pseudocode examples.
> > > > > > > Unfortunately, these went out of date as soon as Ben pushed the
> > PR
> > > > [3],
> > > > > > so
> > > > > > > I need to update those to reflect his changes. Regardless,
> > > > > > > the design is broadly the same, so it should still give a good
> > > > > indication
> > > > > > > of the details of the design.
> > > > > > >
> > > > > > > There are a decent number of review comments on the original PR
> > > that
> > > > I
> > > > > > plan
> > > > > > > to port over where they are still relevant.
> > > > > > > I also plan on adding support for window functions either tonight
> > > or
> > > > on
> > > > > > > Monday.
> > > > > > >
> > > > > > > Please review this design at your earliest convenience. Since
> > > > there's a
> > > > > > > fairly concrete set of types in flatbuffers that
> > > > > > > we can look at, ideally we can center discussion around the
> > details
> > > > in
> > > > > > the
> > > > > > > PR.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > > > > [2]:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > > > > [3]: https://github.com/apache/arrow/pull/10934
> > > > > > >
> > > > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <
> > > jhyde.apa...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > > Wes wrote:
> > > > > > > > >
> > > > > > > > > Supporting this kind of intra-application engine
> > > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > The data format is the natural interface between tasks.
> > (Defining
> > > > > > “task”
> > > > > > > > here as “something that is programmed using the IR”.) That is
> > > > Arrow’s
> > > > > > > > strength.
> > > > > > > >
> > > > > > > > So I think the IR should describe what each task should do, and
> > > > tasks
> > > > > > > > should be fairly small. Not whole relational operators,
> > operating
> > > > on
> > > > > > whole
> > > > > > > > tables, but pieces of relational operators, operating on
> > batches
> > > or
> > > > > > > > sequences of batches.
> > > > > > > >
> > > > > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > > > > Kohn/Leis/Neuman
> > > > > > > > paper [1]. The LoLePop concept sounds good for our purposes.
> > > > > > > >
> > > > > > > > Julian
> > > > > > > >
> > > > > > > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <
> > wesmck...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> > > > cpcl...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > > > > >> jorgecarlei...@gmail.com> wrote:
> > > > > > > > >>
> > > > > > > > >>> Couple of questions
> > > > > > > > >>>
> > > > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> > > > > > (IR,data), the
> > > > > > > > >>> operation "(IR,data) - engine -> result" MUST be the same
> > for
> > > > all
> > > > > > > > "engine"?
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> I think that might be a non-starter for mundane reasons:
> > > there's
> > > > > > > > probably
> > > > > > > > >> at least two engines
> > > > > > > > >> that disagree on the result of sum(x) where x is a floating
> > > > point
> > > > > > > > column.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > > > > >>> * a definition of equality that implementations agree on.
> > > > > > > > >>> * agreement over what the semantics look like. For example,
> > > do
> > > > we
> > > > > > use
> > > > > > > > >>> kleene logic for AND and OR?
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> WRT Kleene logic my thoughts are that the IR should support
> > > both
> > > > > > Kleene
> > > > > > > > and
> > > > > > > > >> non-Kleene,
> > > > > > > > >> and producers can choose their desired semantics.
> > > > > > > > >>
> > > > > > > > >> Ibis for example, would override the `&` operator in `a & b`
> > > to
> > > > > > produce
> > > > > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>>
> > > > > > > > >>> To try to understand the gist, let's pick an aggregated
> > count
> > > > > over
> > > > > > a
> > > > > > > > >>> column: engines often do partial counts over partitions
> > > > followed
> > > > > > by a
> > > > > > > > final
> > > > > > > > >>> "sum" over the partial counts. Is the idea that the query
> > > > engine
> > > > > > would
> > > > > > > > >>> communicate with the compute engine via 2 IRs where one is
> > > > "count
> > > > > > me
> > > > > > > > these"
> > > > > > > > >>> the other is "sum me these"?
> > > > > > > > >>>
> > > > > > > > >>> Best,
> > > > > > > > >>> Jorge
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> Not in its current incarnation.
> > > > > > > > >>
> > > > > > > > >> The idea is that the IR producer communicates a desire to
> > > > count(x)
> > > > > > to a
> > > > > > > > >> consumer, and  it's up to the consumer to figure out how to
> > > turn
> > > > > > that
> > > > > > > > count
> > > > > > > > >> into something that makes sense for itself. In your example
> > > > > that's a
> > > > > > > > series
> > > > > > > > >> of partial counts followed by a sum.
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > > That said, I think there is a valid use case here where a
> > > system
> > > > > > might
> > > > > > > > > make use of different engines to execute different
> > (composable)
> > > > > > layers
> > > > > > > > > of a complex query.
> > > > > > > > >
> > > > > > > > > For example:
> > > > > > > > >
> > > > > > > > > * suppose you want to scan and do predicate pushdown on an
> > > > unusual
> > > > > > > > > data source that is only accessible from one particular
> > engine
> > > > but
> > > > > > > > > * you need to do some analytical operation with the scan
> > > results
> > > > > that
> > > > > > > > > is only supported by another engine
> > > > > > > > >
> > > > > > > > > You could decompose the query into two stages with an IR
> > > > relational
> > > > > > > > > expression for each stage and use then the engines together
> > to
> > > > > > > > > accomplish what you need (of course, you would need an
> > > > > orchestration
> > > > > > > > > layer to handle plumbing the query engine inputs and outputs
> > > > > together
> > > > > > > > > as Arrow streams). Supporting this kind of intra-application
> > > > engine
> > > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > > > >
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> > > > cpcl...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >>>
> > > > > > > > >>>> Thanks Wes,
> > > > > > > > >>>>
> > > > > > > > >>>> Great to be back working on Arrow again and engaging with
> > > the
> > > > > > > > community.
> > > > > > > > >>> I
> > > > > > > > >>>> am really excited about this effort.
> > > > > > > > >>>>
> > > > > > > > >>>> I think there are a number of concerns I see as important
> > to
> > > > > > address
> > > > > > > > in
> > > > > > > > >>> the
> > > > > > > > >>>> compute IR proposal:
> > > > > > > > >>>>
> > > > > > > > >>>> 1. Requirement for output types.
> > > > > > > > >>>>
> > > > > > > > >>>> I think that so far there's been many reasons for
> > requiring
> > > > > > > > conforming IR
> > > > > > > > >>>> producers and consumers to adhere to output types, but I
> > > > haven't
> > > > > > seen
> > > > > > > > a
> > > > > > > > >>>> strong rationale for keeping them optional (in the
> > semantic
> > > > > > sense, not
> > > > > > > > >>> WRT
> > > > > > > > >>>> any particular serialization format's representation of
> > > > > > optionality).
> > > > > > > > >>>>
> > > > > > > > >>>> I think a design that includes unambiguous semantics for
> > > > output
> > > > > > types
> > > > > > > > (a
> > > > > > > > >>>> consumer must produce a value of the requested type or
> > it's
> > > an
> > > > > > > > >>>> error/non-conforming) is simpler to reason about for
> > > > producers,
> > > > > > and
> > > > > > > > >>>> provides a strong guarantee for end users (humans or
> > > machines
> > > > > > > > >>> constructing
> > > > > > > > >>>> IR from an API and expecting the thing they ask for back
> > > from
> > > > > the
> > > > > > IR
> > > > > > > > >>>> consumer).
> > > > > > > > >>>>
> > > > > > > > >>>> 2. Flexibility
> > > > > > > > >>>>
> > > > > > > > >>>> The current PR is currently unable to support what I think
> > > are
> > > > > > killer
> > > > > > > > >>>> features of the IR: custom operators (relational or
> > column)
> > > > and
> > > > > > UDFs.
> > > > > > > > In
> > > > > > > > >>> my
> > > > > > > > >>>> mind, on top of the generalized compute description that
> > the
> > > > IR
> > > > > > > > offers,
> > > > > > > > >>> the
> > > > > > > > >>>> ability for producers and consumers of IR to extend the IR
> > > > > without
> > > > > > > > >>> needing
> > > > > > > > >>>> to modify Arrow or depend on anything except the format is
> > > > > itself
> > > > > > > > >>> something
> > > > > > > > >>>> that is necessary to gain adoption.
> > > > > > > > >>>>
> > > > > > > > >>>> Developers will need to build custom relational operators
> > > > (e.g.,
> > > > > > > > scans of
> > > > > > > > >>>> backends that don't exist anywhere for which a user has
> > code
> > > > to
> > > > > > > > >>> implement)
> > > > > > > > >>>> and custom functions (anything operating on a column that
> > > > > doesn't
> > > > > > > > already
> > > > > > > > >>>> exist, really). Furthermore, I think we can actually drive
> > > > > > building an
> > > > > > > > >>>> Arrow consumer using the same API that an end user would
> > use
> > > > to
> > > > > > extend
> > > > > > > > >>> the
> > > > > > > > >>>> IR.
> > > > > > > > >>>>
> > > > > > > > >>>> 3. Window Functions
> > > > > > > > >>>>
> > > > > > > > >>>> Window functions are, I think, an important part of the IR
> > > > value
> > > > > > > > >>>> proposition, as they are one of the more complex operators
> > > in
> > > > > > > > databases.
> > > > > > > > >>> I
> > > > > > > > >>>> think we need to have something in the initial IR proposal
> > > to
> > > > > > support
> > > > > > > > >>> these
> > > > > > > > >>>> operations.
> > > > > > > > >>>>
> > > > > > > > >>>> 4. Non relational Joins
> > > > > > > > >>>>
> > > > > > > > >>>> Things like as-of join and window join operators aren't
> > yet
> > > > > > fleshed
> > > > > > > > out
> > > > > > > > >>> in
> > > > > > > > >>>> the IR, and I'm not sure they should be in scope for the
> > > > initial
> > > > > > > > >>> prototype.
> > > > > > > > >>>> I think once we settle on a design, we can work the design
> > > of
> > > > > > these
> > > > > > > > >>>> particular operators out during the initial prototype. I
> > > think
> > > > > the
> > > > > > > > >>>> specification of these operators should basically be PR #2
> > > > after
> > > > > > the
> > > > > > > > >>>> initial design lands.
> > > > > > > > >>>>
> > > > > > > > >>>> # Order of Work
> > > > > > > > >>>>
> > > > > > > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > > > > > > >>>>
> > > > > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > > > > >>>>
> > > > > > > > >>>> Ibis is IMO a good candidate for a first IR producer as it
> > > > has a
> > > > > > > > number
> > > > > > > > >>> of
> > > > > > > > >>>> desirable properties that make prototyping faster and
> > allow
> > > > for
> > > > > > us to
> > > > > > > > >>>> refine the design of the IR as needed based on how the
> > > > > > implementation
> > > > > > > > >>> goes:
> > > > > > > > >>>> * It's written in Python so it has native support for
> > nearly
> > > > all
> > > > > > of
> > > > > > > > >>>> flatbuffers' features without having to creating bindings
> > to
> > > > > C++.
> > > > > > > > >>>> * There's already a set of rules for type checking, as
> > well
> > > as
> > > > > > APIs
> > > > > > > > for
> > > > > > > > >>>> constructing expression trees, which means we don't need
> > to
> > > > > worry
> > > > > > > > about
> > > > > > > > >>>> building a type checker for the prototype.
> > > > > > > > >>>>
> > > > > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > > > > >>>>
> > > > > > > > >>>> I think in parallel to the producer prototype we can
> > further
> > > > > > inform
> > > > > > > > the
> > > > > > > > >>>> design from the consumer side by prototyping an IR
> > consumer
> > > in
> > > > > > C++ . I
> > > > > > > > >>> know
> > > > > > > > >>>> Ben Kietzman has expressed interest in working on this.
> > > > > > > > >>>>
> > > > > > > > >>>> Very interested to hear others' thoughts.
> > > > > > > > >>>>
> > > > > > > > >>>> -Phillip
> > > > > > > > >>>>
> > > > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > > > > wesmck...@gmail.com>
> > > > > > > > >>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>>> Thank you for all the feedback and comments on the
> > > document.
> > > > > I'm
> > > > > > on
> > > > > > > > >>>>> vacation this week, so I'm delayed responding to
> > > everything,
> > > > > but
> > > > > > I
> > > > > > > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> > > > > > Copenhagen
> > > > > > > > >>>>> next week if anyone would like to chat in person about
> > it,
> > > > and
> > > > > > we can
> > > > > > > > >>>>> relay the content of any discussions back to the
> > > > > > document/PR/e-mail
> > > > > > > > >>>>> thread.
> > > > > > > > >>>>>
> > > > > > > > >>>>> I know that Phillip Cloud expressed interest in working
> > on
> > > > the
> > > > > > PR and
> > > > > > > > >>>>> helping work through many of the details, so I'm glad to
> > > have
> > > > > the
> > > > > > > > >>>>> help. If there are others who would like to work on the
> > PR
> > > or
> > > > > dig
> > > > > > > > into
> > > > > > > > >>>>> the details, please let me know. We might need to figure
> > > out
> > > > > how
> > > > > > to
> > > > > > > > >>>>> accommodate "many cooks" by setting up the ComputeIR
> > > project
> > > > > > > > somewhere
> > > > > > > > >>>>> separate from the format/ directory to permit it to exist
> > > in
> > > > a
> > > > > > > > >>>>> Work-In-Progress status for a period of time until we
> > work
> > > > > > through
> > > > > > > > the
> > > > > > > > >>>>> various details and design concerns.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Re Julian's comment
> > > > > > > > >>>>>
> > > > > > > > >>>>>> The biggest surprise is that this language does full
> > > > > relational
> > > > > > > > >>>>> operations. I was expecting that it would do fragments of
> > > the
> > > > > > > > >>> operations.
> > > > > > > > >>>>>
> > > > > > > > >>>>> There's a related but different (yet still interesting
> > and
> > > > > > worthy of
> > > > > > > > >>>>> analysis) problem of creating an "engine language" that
> > > > > describes
> > > > > > > > more
> > > > > > > > >>>>> mechanically the constituent parts of implementing the
> > > > > relational
> > > > > > > > >>>>> operators. To create a functional computation language
> > with
> > > > > > concrete
> > > > > > > > >>>>> Arrow data structures as a top-level primitive sounds
> > like
> > > an
> > > > > > > > >>>>> interesting research area where I could see something
> > > > > developing
> > > > > > > > >>>>> eventually.
> > > > > > > > >>>>>
> > > > > > > > >>>>> The main problem I'm interested in solving right now is
> > > > > enabling
> > > > > > > > front
> > > > > > > > >>>>> ends that have sufficient understanding of relational
> > > algebra
> > > > > and
> > > > > > > > data
> > > > > > > > >>>>> frame operations to talk to engines without having to go
> > > > > > backwards
> > > > > > > > >>>>> from their logical query plans to SQL. So as mentioned in
> > > the
> > > > > > > > >>>>> document, being able to faithfully carry the relational
> > > > > operator
> > > > > > node
> > > > > > > > >>>>> information generated by Calcite or Ibis or another
> > system
> > > > > would
> > > > > > be
> > > > > > > > >>>>> super useful. Defining the semantics of various kinds of
> > > > > > user-defined
> > > > > > > > >>>>> functions would also be helpful to standardize the
> > > > > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > > > > >>>>>
> > > > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > > > > alen...@gmail.com>
> > > > > > > > >>>> wrote:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Hi Wes,
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL
> > > > > Analytics
> > > > > > > > From
> > > > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > > > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> > > > which
> > > > > > > > >>>> proposes a
> > > > > > > > >>>>>> set of building block for advanced aggregation.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Cheers,
> > > > > > > > >>>>>> Dimitri.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > > > > jhyde.apa...@gmail.com>
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>> Wes,
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc and to
> > > the
> > > > > PR.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> The biggest surprise is that this language does full
> > > > > relational
> > > > > > > > >>>>>>> operations. I was expecting that it would do fragments
> > of
> > > > the
> > > > > > > > >>>>> operations.
> > > > > > > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> > > > > > partition
> > > > > > > > >>> rows
> > > > > > > > >>>>> into
> > > > > > > > >>>>>>> output buffers based on a hash key, build hash tables,
> > > > probe
> > > > > > into
> > > > > > > > >>>> hash
> > > > > > > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and
> > > so
> > > > > > forth.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> > > > operations,
> > > > > > > > >>> working
> > > > > > > > >>>>> on
> > > > > > > > >>>>>>> perhaps a batch at a time, or a sequence of batches,
> > with
> > > > > some
> > > > > > > > >>> other
> > > > > > > > >>>>> system
> > > > > > > > >>>>>>> coordinating those tasks. So I would expect to see
> > > Arrow’s
> > > > > > compute
> > > > > > > > >>>>> language
> > > > > > > > >>>>>>> mainly operating on batches rather than a table
> > > > abstraction.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Julian
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > > > > wesmck...@gmail.com
> > > > > > >
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> hi folks,
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> This idea came up in passing in the past -- given that
> > > > there
> > > > > > are
> > > > > > > > >>>>>>>> multiple independent efforts to develop Arrow-native
> > > query
> > > > > > > > >>> engines
> > > > > > > > >>>>>>>> (and surely many more to come), it seems like it would
> > > be
> > > > > > > > >>> valuable
> > > > > > > > >>>> to
> > > > > > > > >>>>>>>> have a way to enable user languages (like Java,
> > Python,
> > > R,
> > > > > or
> > > > > > > > >>> Rust,
> > > > > > > > >>>>>>>> for example) to communicate with backend computing
> > > engines
> > > > > > (like
> > > > > > > > >>>>>>>> DataFusion, or new computing capabilities being built
> > in
> > > > the
> > > > > > > > >>> Arrow
> > > > > > > > >>>>> C++
> > > > > > > > >>>>>>>> library) in a fashion that is "lower-level" than SQL
> > and
> > > > > > > > >>>> specialized
> > > > > > > > >>>>>>>> to Arrow's type system. So rather than leaving it to a
> > > SQL
> > > > > > > > >>> parser /
> > > > > > > > >>>>>>>> analyzer framework to generate an expression tree of
> > > > > > relational
> > > > > > > > >>>>>>>> operators and then translate that to an Arrow-native
> > > > > > > > >>>> compute-engine's
> > > > > > > > >>>>>>>> internal grammer, a user framework could provide the
> > > > desired
> > > > > > > > >>>>>>>> Arrow-native expression tree / data manipulations
> > > directly
> > > > > and
> > > > > > > > >>> skip
> > > > > > > > >>>>>>>> the SQL altogether.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > > > > representation
> > > > > > > > >>>> (IR)"
> > > > > > > > >>>>>>>> for Arrow compute operations would be to serve use
> > cases
> > > > > large
> > > > > > > > >>> and
> > > > > > > > >>>>>>>> small -- from the most complex TPC-* or time series
> > > > database
> > > > > > > > >>> query
> > > > > > > > >>>> to
> > > > > > > > >>>>>>>> the most simple array predicate/filter sent with an
> > RPC
> > > > > > request
> > > > > > > > >>>> using
> > > > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > > > > engine-agnostic,
> > > > > > > > >>>> with
> > > > > > > > >>>>>>>> the only commonality across usages being the Arrow
> > > > columnar
> > > > > > > > >>> format
> > > > > > > > >>>>>>>> (schemas and array types). This would be better than
> > > > leaving
> > > > > > it
> > > > > > > > >>> to
> > > > > > > > >>>>>>>> each application to develop its own bespoke expression
> > > > > > > > >>>>> representations
> > > > > > > > >>>>>>>> for its needs.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> I spent a while thinking about this and wrote up a
> > brain
> > > > > dump
> > > > > > RFC
> > > > > > > > >>>>>>>> document [1] and accompanying pull request [2] that
> > > makes
> > > > > the
> > > > > > > > >>>>> possibly
> > > > > > > > >>>>>>>> controversial choice of using Flatbuffers to represent
> > > the
> > > > > > > > >>>> serialized
> > > > > > > > >>>>>>>> IR. I discuss the rationale for the choice of
> > > Flatbuffers
> > > > in
> > > > > > the
> > > > > > > > >>>> RFC
> > > > > > > > >>>>>>>> document. This PR is obviously deficient in many
> > regards
> > > > > > > > >>>> (incomplete,
> > > > > > > > >>>>>>>> hacky, or unclear in places), and will need some help
> > > from
> > > > > > others
> > > > > > > > >>>> to
> > > > > > > > >>>>>>>> flesh out. I suspect that actually implementing the IR
> > > > will
> > > > > be
> > > > > > > > >>>>>>>> necessary to work out many of the low-level details.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Note that this IR is intended to be more of a
> > "superset"
> > > > > > project
> > > > > > > > >>>> than
> > > > > > > > >>>>>>>> a "lowest common denominator". So there may be things
> > > that
> > > > > it
> > > > > > > > >>>>> includes
> > > > > > > > >>>>>>>> which are only available in some engines (e.g. engines
> > > > that
> > > > > > have
> > > > > > > > >>>>>>>> specialized handling of time series data).
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> As some of my personal motivation for the project,
> > > > > concurrent
> > > > > > > > >>> with
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python project
> > > called
> > > > > > Ibis
> > > > > > > > >>> [3]
> > > > > > > > >>>>>>>> (which is similar to R's dplyr project) which serves
> > as
> > > a
> > > > > > "Python
> > > > > > > > >>>>>>>> analytical query IR builder" that is capable of
> > > generating
> > > > > > most
> > > > > > > > >>> of
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>>> SQL standard, targeting many different SQL dialects
> > and
> > > > > other
> > > > > > > > >>>>> backends
> > > > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have
> > > used
> > > > > this
> > > > > > > > >>>>> library
> > > > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to
> > be
> > > > able
> > > > > > to
> > > > > > > > >>>>>>>> translate between the in-memory "logical query" data
> > > > > > structures
> > > > > > > > >>> in
> > > > > > > > >>>> a
> > > > > > > > >>>>>>>> library like Ibis to a serialized format that can be
> > > > > executed
> > > > > > by
> > > > > > > > >>>> many
> > > > > > > > >>>>>>>> different Arrow-native query engines. The expression
> > > > > > primitives
> > > > > > > > >>>>>>>> available in Ibis should serve as helpful test cases,
> > > too.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> I look forward to the community's comments on the RFC
> > > > > document
> > > > > > > > >>> [1]
> > > > > > > > >>>>> and
> > > > > > > > >>>>>>>> pull request [2] -- I realize that arriving at
> > consensus
> > > > on
> > > > > a
> > > > > > > > >>>> complex
> > > > > > > > >>>>>>>> and ambitious project like this can be challenging so
> > I
> > > > > > recommend
> > > > > > > > >>>>>>>> spending time on the "non-goals" section in the RFC
> > and
> > > > ask
> > > > > > > > >>>> questions
> > > > > > > > >>>>>>>> if you are unclear about the scope of what problems
> > this
> > > > is
> > > > > > > > >>> trying
> > > > > > > > >>>> to
> > > > > > > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> > > > > > document
> > > > > > > > >>> to
> > > > > > > > >>>>>>>> others and would consider ideas about how to move
> > > forward
> > > > > with
> > > > > > > > >>>>>>>> something that is able to be implemented by different
> > > > Arrow
> > > > > > > > >>>> libraries
> > > > > > > > >>>>>>>> in the reasonably near future.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Thanks,
> > > > > > > > >>>>>>>> Wes
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> [1]:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > > > > >>>>>>>> [4]:
> > > > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > > > >>>>>>>> [5]:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Reply via email to