Hi, I also agree that we should follow a model similar to what you propose. I think the plan is, correct me if I'm wrong Wes, to write the logical plan operators, then write a small execution engine prototype and produce a proper design document out of this experiment. There's also a placeholder ticket: https://issues.apache.org/jira/browse/ARROW-4333
François On Wed, Feb 13, 2019 at 9:54 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi Ravindra, > > > On Wed, Feb 13, 2019 at 1:34 AM Ravindra Pindikura <ravin...@dremio.com> > wrote: > > > > Hi, > > > > I was looking at the recent checkin for arrow kernels, and started to > think of how they would work alongside Gandiva. > > > > Here are my thoughts : > > > > 1. Gandiva already has two high-level operators namely project and > filter, with runtime code generation > > > > * It already supports 100s of functions (eg. a+b, a > b), which can be > combined into expressions (eg. a+b > c && a +b < d) for each of the > operators and we’ll likely continue to add more of them. > > * it works on one record batch at a time - consumes a record batch, and > produces a record batch. > > * The operators can be inter-linked (eg. Project -> filter -> project) > to build a pipeline. > > * we may build additional operators in the future which could benefit > from code generation (eg. Impala uses code generation when parsing Avro > files). > > > > 2. Arrow Kernels > > > > a. support project/filter operators > > > > Useful for functions where there is no benefit with code generation, or > where code generation can be skipped over (eager evaluation). > > > > b. Support for additional operators like aggregates > > > > > > How do we combine and link the gandiva operators and the kernels ? For > eg. It would be nice to have a pipeline with scan (read from source), > project (expression on column), filter (extract rows), and aggregate (sum > on the extracted column). > > > > I'm just starting on a project now to enable logical expressions to be > built with a C++ API that are formed from a superset of relationship > algebra. In the past, I have already built a fairly deep system for > this (https://github.com/ibis-project/ibis) that captures SQL > semantics while also permitting other kinds of analytical > functionality that spills outside of SQL. As soon as I cobble together > a prototype I'll be interested in lots of feedback since this is a > system that we'll use for many years > > My idea of the interplay with Gandiva is that expressions will be > "lowered" to physical execution operators such as: > > * Projection (with/without filters) > * Join > * Hash aggregate (with/without filters) > * Sort > * Time series operations (resampling / binning, etc.) > > Gandiva therefore serves as a subgraph compiler for constituent parts > of these operations . During the lowering process, operator subgraphs > will need to be recognized as "able to be compiled with Gandiva". For > example > > * log(table.a + table.b) + 1 can be compiled with gandiva > * udf(table.a), where "udf" is a user-defined function written in > Python, say, cannot > > > To do this, I think we would need to be able build a pipeline with high > level operators that move along data one record batch at a time : > > - source operator which only produces record-batches (maybe, csv reader) > > - intermediate operators that can produce/consume record-batches (maybe, > gandiva project operator) > > - terminal operators that emit the final output (from the end of the > pipeline) when there is nothing left to consume (maybe, SumKernel) > > > > Are we thinking along these lines ? > > Yes, that sounds right. > > We don't have a "dataset" framework yet for source operators (CSV, > JSON, and Parquet will be the first ones). I have planned to write a > requirements document about this project as we have the machinery in > place to e.g. read Parquet and CSV files but without a unified > abstract dataset API > > - Wes > > > > > Thanks & regards, > > Ravindra. >