Hi,

I also agree that we should follow a model similar to what you propose. I
think the plan is, correct me if I'm wrong Wes, to write the logical plan
operators, then write a small execution engine prototype and produce a
proper design document out of this experiment. There's also a placeholder
ticket: https://issues.apache.org/jira/browse/ARROW-4333

François

On Wed, Feb 13, 2019 at 9:54 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Ravindra,
>
>
> On Wed, Feb 13, 2019 at 1:34 AM Ravindra Pindikura <ravin...@dremio.com>
> wrote:
> >
> > Hi,
> >
> > I was looking at the recent checkin for arrow kernels, and started to
> think of how they would work alongside Gandiva.
> >
> > Here are my thoughts :
> >
> > 1. Gandiva already has two high-level operators namely project and
> filter, with runtime code generation
> >
> > * It already supports 100s of functions (eg. a+b, a > b), which can be
> combined into expressions (eg. a+b > c && a +b < d) for each of the
> operators and we’ll likely continue to add more of them.
> > * it works on one record batch at a time - consumes a record batch, and
> produces a record batch.
> > * The operators can be inter-linked (eg. Project -> filter -> project)
> to build a pipeline.
> > * we may build additional operators in the future which could benefit
> from code generation (eg. Impala uses code generation when parsing Avro
> files).
> >
> > 2. Arrow Kernels
> >
> > a. support project/filter operators
> >
> > Useful for functions where there is no benefit with code generation, or
> where code generation can be skipped over (eager evaluation).
> >
> > b. Support for additional operators like aggregates
> >
> >
> > How do we combine and link the gandiva operators and the kernels ? For
> eg. It would be nice to have a pipeline with scan (read from source),
> project (expression on column), filter (extract rows), and aggregate (sum
> on the extracted column).
> >
>
> I'm just starting on a project now to enable logical expressions to be
> built with a C++ API that are formed from a superset of relationship
> algebra. In the past, I have already built a fairly deep system for
> this (https://github.com/ibis-project/ibis) that captures SQL
> semantics while also permitting other kinds of analytical
> functionality that spills outside of SQL. As soon as I cobble together
> a prototype I'll be interested in lots of feedback since this is a
> system that we'll use for many years
>
> My idea of the interplay with Gandiva is that expressions will be
> "lowered" to physical execution operators such as:
>
> * Projection (with/without filters)
> * Join
> * Hash aggregate (with/without filters)
> * Sort
> * Time series operations (resampling / binning, etc.)
>
> Gandiva therefore serves as a subgraph compiler for constituent parts
> of these operations . During the lowering process, operator subgraphs
> will need to be recognized as "able to be compiled with Gandiva". For
> example
>
> * log(table.a + table.b) + 1 can be compiled with gandiva
> * udf(table.a), where "udf" is a user-defined function written in
> Python, say, cannot
>
> > To do this, I think we would need to be able build a pipeline with high
> level operators that move along data one record batch at a time :
> > - source operator which only produces record-batches (maybe, csv reader)
> > - intermediate operators that can produce/consume record-batches (maybe,
> gandiva project operator)
> > - terminal operators that emit the final output (from the end of the
> pipeline) when there is nothing left to consume (maybe, SumKernel)
> >
> > Are we thinking along these lines ?
>
> Yes, that sounds right.
>
> We don't have a "dataset" framework yet for source operators (CSV,
> JSON, and Parquet will be the first ones). I have planned to write a
> requirements document about this project as we have the machinery in
> place to e.g. read Parquet and CSV files but without a unified
> abstract dataset API
>
> - Wes
>
> >
> > Thanks & regards,
> > Ravindra.
>

Reply via email to