Hey all, Just wanted to give an update on the effort here.
Ben Kietzman has created an alternative proposal to the initial design [1]. It largely overlaps with the original, but differs in a few important ways: * A big focus of the design is on flexibility, allowing producers, consumers and ultimately end users of those systems the ability to define custom operations in the graph. * There are very few predefined relational operations (project, filter, join and a handful of others). * There are only 3 types of value expressions: literals, field references, and function calls. * The model of evaluation is one that requires a final sink node, to indicate where the record batches from child relations end up (a file, a socket, an in-memory buffer, etc). I've added notes [2] to the original Google doc (under the Alternative Design Notes subheading), and a few pseudocode examples. Unfortunately, these went out of date as soon as Ben pushed the PR [3], so I need to update those to reflect his changes. Regardless, the design is broadly the same, so it should still give a good indication of the details of the design. There are a decent number of review comments on the original PR that I plan to port over where they are still relevant. I also plan on adding support for window functions either tonight or on Monday. Please review this design at your earliest convenience. Since there's a fairly concrete set of types in flatbuffers that we can look at, ideally we can center discussion around the details in the PR. Thanks! [1]: https://github.com/apache/arrow/pull/10856 [2]: https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13 [3]: https://github.com/apache/arrow/pull/10934 On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jhyde.apa...@gmail.com> wrote: > > Wes wrote: > > > > Supporting this kind of intra-application engine > > heterogeneity is one of the motivations for the project. > > +1 > > The data format is the natural interface between tasks. (Defining “task” > here as “something that is programmed using the IR”.) That is Arrow’s > strength. > > So I think the IR should describe what each task should do, and tasks > should be fairly small. Not whole relational operators, operating on whole > tables, but pieces of relational operators, operating on batches or > sequences of batches. > > Elsethread, someone mentioned the LoLePop concept and the Kohn/Leis/Neuman > paper [1]. The LoLePop concept sounds good for our purposes. > > Julian > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <wesmck...@gmail.com> wrote: > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cpcl...@gmail.com> > wrote: > >> > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão < > >> jorgecarlei...@gmail.com> wrote: > >> > >>> Couple of questions > >>> > >>> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the > >>> operation "(IR,data) - engine -> result" MUST be the same for all > "engine"? > >>> > >> > >> I think that might be a non-starter for mundane reasons: there's > probably > >> at least two engines > >> that disagree on the result of sum(x) where x is a floating point > column. > >> > >> > >>> 2. if yes, imo we may need to worry about: > >>> * a definition of equality that implementations agree on. > >>> * agreement over what the semantics look like. For example, do we use > >>> kleene logic for AND and OR? > >>> > >> > >> WRT Kleene logic my thoughts are that the IR should support both Kleene > and > >> non-Kleene, > >> and producers can choose their desired semantics. > >> > >> Ibis for example, would override the `&` operator in `a & b` to produce > >> `KleeneAnd(Column(a), Column(b))`. > >> > >> > >>> > >>> To try to understand the gist, let's pick an aggregated count over a > >>> column: engines often do partial counts over partitions followed by a > final > >>> "sum" over the partial counts. Is the idea that the query engine would > >>> communicate with the compute engine via 2 IRs where one is "count me > these" > >>> the other is "sum me these"? > >>> > >>> Best, > >>> Jorge > >>> > >> > >> Not in its current incarnation. > >> > >> The idea is that the IR producer communicates a desire to count(x) to a > >> consumer, and it's up to the consumer to figure out how to turn that > count > >> into something that makes sense for itself. In your example that's a > series > >> of partial counts followed by a sum. > >> > > > > That said, I think there is a valid use case here where a system might > > make use of different engines to execute different (composable) layers > > of a complex query. > > > > For example: > > > > * suppose you want to scan and do predicate pushdown on an unusual > > data source that is only accessible from one particular engine but > > * you need to do some analytical operation with the scan results that > > is only supported by another engine > > > > You could decompose the query into two stages with an IR relational > > expression for each stage and use then the engines together to > > accomplish what you need (of course, you would need an orchestration > > layer to handle plumbing the query engine inputs and outputs together > > as Arrow streams). Supporting this kind of intra-application engine > > heterogeneity is one of the motivations for the project. > > > >>> > >>> > >>> > >>> > >>> > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cpcl...@gmail.com> > wrote: > >>> > >>>> Thanks Wes, > >>>> > >>>> Great to be back working on Arrow again and engaging with the > community. > >>> I > >>>> am really excited about this effort. > >>>> > >>>> I think there are a number of concerns I see as important to address > in > >>> the > >>>> compute IR proposal: > >>>> > >>>> 1. Requirement for output types. > >>>> > >>>> I think that so far there's been many reasons for requiring > conforming IR > >>>> producers and consumers to adhere to output types, but I haven't seen > a > >>>> strong rationale for keeping them optional (in the semantic sense, not > >>> WRT > >>>> any particular serialization format's representation of optionality). > >>>> > >>>> I think a design that includes unambiguous semantics for output types > (a > >>>> consumer must produce a value of the requested type or it's an > >>>> error/non-conforming) is simpler to reason about for producers, and > >>>> provides a strong guarantee for end users (humans or machines > >>> constructing > >>>> IR from an API and expecting the thing they ask for back from the IR > >>>> consumer). > >>>> > >>>> 2. Flexibility > >>>> > >>>> The current PR is currently unable to support what I think are killer > >>>> features of the IR: custom operators (relational or column) and UDFs. > In > >>> my > >>>> mind, on top of the generalized compute description that the IR > offers, > >>> the > >>>> ability for producers and consumers of IR to extend the IR without > >>> needing > >>>> to modify Arrow or depend on anything except the format is itself > >>> something > >>>> that is necessary to gain adoption. > >>>> > >>>> Developers will need to build custom relational operators (e.g., > scans of > >>>> backends that don't exist anywhere for which a user has code to > >>> implement) > >>>> and custom functions (anything operating on a column that doesn't > already > >>>> exist, really). Furthermore, I think we can actually drive building an > >>>> Arrow consumer using the same API that an end user would use to extend > >>> the > >>>> IR. > >>>> > >>>> 3. Window Functions > >>>> > >>>> Window functions are, I think, an important part of the IR value > >>>> proposition, as they are one of the more complex operators in > databases. > >>> I > >>>> think we need to have something in the initial IR proposal to support > >>> these > >>>> operations. > >>>> > >>>> 4. Non relational Joins > >>>> > >>>> Things like as-of join and window join operators aren't yet fleshed > out > >>> in > >>>> the IR, and I'm not sure they should be in scope for the initial > >>> prototype. > >>>> I think once we settle on a design, we can work the design of these > >>>> particular operators out during the initial prototype. I think the > >>>> specification of these operators should basically be PR #2 after the > >>>> initial design lands. > >>>> > >>>> # Order of Work > >>>> > >>>> 1. Nail down the design. Anything else is a non-starter. > >>>> > >>>> 2. Prototype an IR producer using Ibis > >>>> > >>>> Ibis is IMO a good candidate for a first IR producer as it has a > number > >>> of > >>>> desirable properties that make prototyping faster and allow for us to > >>>> refine the design of the IR as needed based on how the implementation > >>> goes: > >>>> * It's written in Python so it has native support for nearly all of > >>>> flatbuffers' features without having to creating bindings to C++. > >>>> * There's already a set of rules for type checking, as well as APIs > for > >>>> constructing expression trees, which means we don't need to worry > about > >>>> building a type checker for the prototype. > >>>> > >>>> 3. Prototype an IR consumer in C++ > >>>> > >>>> I think in parallel to the producer prototype we can further inform > the > >>>> design from the consumer side by prototyping an IR consumer in C++ . I > >>> know > >>>> Ben Kietzman has expressed interest in working on this. > >>>> > >>>> Very interested to hear others' thoughts. > >>>> > >>>> -Phillip > >>>> > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>>> > >>>>> Thank you for all the feedback and comments on the document. I'm on > >>>>> vacation this week, so I'm delayed responding to everything, but I > >>>>> will get to it as quickly as I can. I will be at VLDB in Copenhagen > >>>>> next week if anyone would like to chat in person about it, and we can > >>>>> relay the content of any discussions back to the document/PR/e-mail > >>>>> thread. > >>>>> > >>>>> I know that Phillip Cloud expressed interest in working on the PR and > >>>>> helping work through many of the details, so I'm glad to have the > >>>>> help. If there are others who would like to work on the PR or dig > into > >>>>> the details, please let me know. We might need to figure out how to > >>>>> accommodate "many cooks" by setting up the ComputeIR project > somewhere > >>>>> separate from the format/ directory to permit it to exist in a > >>>>> Work-In-Progress status for a period of time until we work through > the > >>>>> various details and design concerns. > >>>>> > >>>>> Re Julian's comment > >>>>> > >>>>>> The biggest surprise is that this language does full relational > >>>>> operations. I was expecting that it would do fragments of the > >>> operations. > >>>>> > >>>>> There's a related but different (yet still interesting and worthy of > >>>>> analysis) problem of creating an "engine language" that describes > more > >>>>> mechanically the constituent parts of implementing the relational > >>>>> operators. To create a functional computation language with concrete > >>>>> Arrow data structures as a top-level primitive sounds like an > >>>>> interesting research area where I could see something developing > >>>>> eventually. > >>>>> > >>>>> The main problem I'm interested in solving right now is enabling > front > >>>>> ends that have sufficient understanding of relational algebra and > data > >>>>> frame operations to talk to engines without having to go backwards > >>>>> from their logical query plans to SQL. So as mentioned in the > >>>>> document, being able to faithfully carry the relational operator node > >>>>> information generated by Calcite or Ibis or another system would be > >>>>> super useful. Defining the semantics of various kinds of user-defined > >>>>> functions would also be helpful to standardize the > >>>>> engine-to-user-language UDF/extension interface. > >>>>> > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <alen...@gmail.com> > >>>> wrote: > >>>>>> > >>>>>> Hi Wes, > >>>>>> > >>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics > From > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 ( > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which > >>>> proposes a > >>>>>> set of building block for advanced aggregation. > >>>>>> > >>>>>> Cheers, > >>>>>> Dimitri. > >>>>>> > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jhyde.apa...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>>> Wes, > >>>>>>> > >>>>>>> Thanks for this. I’ve added comments to the doc and to the PR. > >>>>>>> > >>>>>>> The biggest surprise is that this language does full relational > >>>>>>> operations. I was expecting that it would do fragments of the > >>>>> operations. > >>>>>>> Consider join. A distributed hybrid hash join needs to partition > >>> rows > >>>>> into > >>>>>>> output buffers based on a hash key, build hash tables, probe into > >>>> hash > >>>>>>> tables, scan hash tables for untouched “outer”rows, and so forth. > >>>>>>> > >>>>>>> I see Arrow’s compute as delivering each of those operations, > >>> working > >>>>> on > >>>>>>> perhaps a batch at a time, or a sequence of batches, with some > >>> other > >>>>> system > >>>>>>> coordinating those tasks. So I would expect to see Arrow’s compute > >>>>> language > >>>>>>> mainly operating on batches rather than a table abstraction. > >>>>>>> > >>>>>>> Julian > >>>>>>> > >>>>>>> > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmck...@gmail.com> > >>>>> wrote: > >>>>>>>> > >>>>>>>> hi folks, > >>>>>>>> > >>>>>>>> This idea came up in passing in the past -- given that there are > >>>>>>>> multiple independent efforts to develop Arrow-native query > >>> engines > >>>>>>>> (and surely many more to come), it seems like it would be > >>> valuable > >>>> to > >>>>>>>> have a way to enable user languages (like Java, Python, R, or > >>> Rust, > >>>>>>>> for example) to communicate with backend computing engines (like > >>>>>>>> DataFusion, or new computing capabilities being built in the > >>> Arrow > >>>>> C++ > >>>>>>>> library) in a fashion that is "lower-level" than SQL and > >>>> specialized > >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL > >>> parser / > >>>>>>>> analyzer framework to generate an expression tree of relational > >>>>>>>> operators and then translate that to an Arrow-native > >>>> compute-engine's > >>>>>>>> internal grammer, a user framework could provide the desired > >>>>>>>> Arrow-native expression tree / data manipulations directly and > >>> skip > >>>>>>>> the SQL altogether. > >>>>>>>> > >>>>>>>> The idea of creating a "serialized intermediate representation > >>>> (IR)" > >>>>>>>> for Arrow compute operations would be to serve use cases large > >>> and > >>>>>>>> small -- from the most complex TPC-* or time series database > >>> query > >>>> to > >>>>>>>> the most simple array predicate/filter sent with an RPC request > >>>> using > >>>>>>>> Arrow Flight. It is deliberately language- and engine-agnostic, > >>>> with > >>>>>>>> the only commonality across usages being the Arrow columnar > >>> format > >>>>>>>> (schemas and array types). This would be better than leaving it > >>> to > >>>>>>>> each application to develop its own bespoke expression > >>>>> representations > >>>>>>>> for its needs. > >>>>>>>> > >>>>>>>> I spent a while thinking about this and wrote up a brain dump RFC > >>>>>>>> document [1] and accompanying pull request [2] that makes the > >>>>> possibly > >>>>>>>> controversial choice of using Flatbuffers to represent the > >>>> serialized > >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in the > >>>> RFC > >>>>>>>> document. This PR is obviously deficient in many regards > >>>> (incomplete, > >>>>>>>> hacky, or unclear in places), and will need some help from others > >>>> to > >>>>>>>> flesh out. I suspect that actually implementing the IR will be > >>>>>>>> necessary to work out many of the low-level details. > >>>>>>>> > >>>>>>>> Note that this IR is intended to be more of a "superset" project > >>>> than > >>>>>>>> a "lowest common denominator". So there may be things that it > >>>>> includes > >>>>>>>> which are only available in some engines (e.g. engines that have > >>>>>>>> specialized handling of time series data). > >>>>>>>> > >>>>>>>> As some of my personal motivation for the project, concurrent > >>> with > >>>>> the > >>>>>>>> genesis of Apache Arrow, I started a Python project called Ibis > >>> [3] > >>>>>>>> (which is similar to R's dplyr project) which serves as a "Python > >>>>>>>> analytical query IR builder" that is capable of generating most > >>> of > >>>>> the > >>>>>>>> SQL standard, targeting many different SQL dialects and other > >>>>> backends > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this > >>>>> library > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able to > >>>>>>>> translate between the in-memory "logical query" data structures > >>> in > >>>> a > >>>>>>>> library like Ibis to a serialized format that can be executed by > >>>> many > >>>>>>>> different Arrow-native query engines. The expression primitives > >>>>>>>> available in Ibis should serve as helpful test cases, too. > >>>>>>>> > >>>>>>>> I look forward to the community's comments on the RFC document > >>> [1] > >>>>> and > >>>>>>>> pull request [2] -- I realize that arriving at consensus on a > >>>> complex > >>>>>>>> and ambitious project like this can be challenging so I recommend > >>>>>>>> spending time on the "non-goals" section in the RFC and ask > >>>> questions > >>>>>>>> if you are unclear about the scope of what problems this is > >>> trying > >>>> to > >>>>>>>> solve. I would be happy to give Edit access on the RFC document > >>> to > >>>>>>>> others and would consider ideas about how to move forward with > >>>>>>>> something that is able to be implemented by different Arrow > >>>> libraries > >>>>>>>> in the reasonably near future. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Wes > >>>>>>>> > >>>>>>>> [1]: > >>>>>>> > >>>>> > >>>> > >>> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit# > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856 > >>>>>>>> [3]: https://ibis-project.org/ > >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf > >>>>>>>> [5]: > >>>>>>> > >>>>> > >>>> > >>> > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt > >>>>>>> > >>>>>>> > >>>>> > >>>> > >>> > >