Re: Proposal to extend Calcite into a incremental query optimizer

Rui Wang Mon, 11 Jan 2021 23:29:58 -0800

Congratulations on your work! And thanks for considering contributing it to
Calcite!


I tried to have a pass on your paper, and have a question:

It seems like for concrete implementations for merge operator and delta
query for joins/aggregation, the paper adopts some algorithms and ideas
that were proposed in prior works. Do you have a list of sources about
those algorithms? I assume those implementations can be customized but some
common ones can be offered within Calcite. If so, having a list might help
people understand implementation details. (or maybe I should just read
relevant references in the paper :-))



-Rui

On Fri, Jan 1, 2021 at 11:05 AM Botong Huang <pku...@gmail.com> wrote:

> Hi Julian,
>
> Thanks for your interest! Sure let's figure out a plan that best benefits
> the community. Here are some clarifications that hopefully answer your
> questions.
>
> In our work (Tempura), users specify the set of time points to consider
> running and a cost function that expresses users' preference over time,
> Tempura will generate the best incremental plan that minimizes the overall
> cost function.
>
> In this incremental plan, the sub-plans at different time points can be
> different from each other, as opposed to identical plans in all delta runs
> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we can
> mimic the current streaming implementation by specifying two (logical) time
> points in Tempura, representing the initial run and later delta runs
> respectively. In general, note that Tempura supports various form of
> incremental computing, not only the small-delta append-only data model in
> streaming systems. That's why we believe Tempura subsumes the current
> streaming support, as well as any IVM implementations.
>
> About the cost model, we did not come up with a seperate cost model, but
> rather extended the existing one. Similar to multi-objective optimization,
> costs incurred at different time points are considered different
> dimensions. Tempura lets users supply a function that converts this cost
> vector into a final cost. So under this function, any two incremental plans
> are still comparable and there is an overall optimum. I guess we can go
> down the route of multi-objective parametric query optimization instead if
> there is a need.
>
> Next on materialized views and multi-query optimization, since our
> multi-time-point plan naturally involves materializing intermediate results
> for later time points, we need to solve the problem of choosing
> materializations and include the cost of saving and reusing the
> materializations when costing and comparing plans. We borrowed the
> multi-query optimization techniques to solve this problem even though we
> are looking at a single query. As a result, we think our work is orthogonal
> to Calcite's facilities around utilizing existing views, lattice etc. We do
> feel that the multi-query optimization component can be adopted to wider
> use, but probably need more suggestions from the community.
>
> Lastly, our current implementation is set up in java code, it should be
> straightforward to hook it up with SQL shell.
>
> Thanks,
> Botong
>
> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <jhyde.apa...@gmail.com>
> wrote:
>
> > Botong,
> >
> > This is very exciting; congratulations on this research, and thank you
> for
> > contributing it back to Calcite.
> >
> > The research touches several areas in Calcite: streaming, materialized
> > view maintenance, and multi-query optimization. As we have already some
> > solutions in those areas (Sigma and Delta relational operators, lattice,
> > and Spool operator), it will be interesting to see whether we can make
> them
> > compatible, or whether one concept can subsume others.
> >
> > Your work differs from streaming queries in that your relations are used
> > by “external” user queries, whereas in pure streaming queries, the only
> > activity is the change propagation. Did you find that you needed two
> > separate cost models - one for “view maintenance” and another for “user
> > queries” - since the objectives of each activity are so different?
> >
> > I wonder whether this work will hasten the arrival of multi-objective
> > parametric query optimization [1] in Calcite.
> >
> > I will make time over the next few days to read and digest your paper.
> > Then I expect that we will have a back-and-forth process to create
> > something that will be useful for the broader community.
> >
> > One thing will be particularly useful: making this functionality
> available
> > from a SQL shell, so that people can experiment with this functionality
> > without writing Java code or setting up complex databases and metadata. I
> > have in mind something like the simple DDL operations that are available
> in
> > Calcite’s ’server’ module. I wonder whether we could devise some kind of
> > SQL syntax for a “multi-query”.
> >
> > Julian
> >
> > [1]
> >
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
> >
> >
> >
> > > On Dec 23, 2020, at 8:55 PM, Botong Huang <pku...@gmail.com> wrote:
> > >
> > > Thanks Aron for pointing this out. To see the figure, please refer to
> Fig
> > > 3(a) in our paper:
> > https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
> > >
> > > Best,
> > > Botong
> > >
> > > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <taojia...@gmail.com>
> wrote:
> > >
> > >> Seems interesting, the pic can not be seen in the mail, may you open a
> > JIRA
> > >> for this, people who are interested in this can subscribe to the JIRA?
> > >>
> > >>
> > >> Regards!
> > >>
> > >> Aron Tao
> > >>
> > >>
> > >> Botong Huang <bot...@apache.org> 于2020年12月24日周四 上午3:18写道：
> > >>
> > >>> Hi all,
> > >>>
> > >>> This is a proposal to extend the Calcite optimizer into a general
> > >>> incremental query optimizer, based on our research paper published in
> > >> VLDB
> > >>> 2021:
> > >>> Tempura: a general cost-based optimizer framework for incremental
> data
> > >>> processing
> > >>>
> > >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s data
> > >>> warehouse is planning to use this incremental query optimizer to
> > >> alleviate
> > >>> cluster-wise resource skewness:
> > >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
> > >> Computing
> > >>>
> > >>> To our best knowledge, this is the first general cost-based
> incremental
> > >>> optimizer that can find the best plan across multiple families of
> > >>> incremental computing methods, including IVM, Streaming, DBToaster,
> > etc.
> > >>> Experiments (in the paper) shows that the generated best plan is
> > >>> consistently much better than the plans from each individual method
> > >> alone.
> > >>>
> > >>> In general, incremental query planning is central to database view
> > >>> maintenance and stream processing systems, and are being adopted in
> > >> active
> > >>> databases, resumable query execution, approximate query processing,
> > etc.
> > >> We
> > >>> are hoping that this feature can help widening the spectrum of
> Calcite,
> > >>> solicit more use cases and adoption of Calcite.
> > >>>
> > >>> Below is a brief description of the technical details. Please refer
> to
> > >> the
> > >>> Tempura paper for more details. We are also working on a journal
> > version
> > >> of
> > >>> the paper with more implementation details.
> > >>>
> > >>> Currently the query plan generated by Calcite is meant to be executed
> > >>> altogether at once. In the proposal, Calcite’s memo will be extended
> > with
> > >>> temporal information so that it is capable of generating incremental
> > >> plans
> > >>> that include multiple sub-plans to execute at different time points.
> > >>>
> > >>> The main idea is to view each table as one that changes over time
> (Time
> > >>> Varying Relations (TVR)). To achieve that we introduced TvrMetaSet
> into
> > >>> Calcite’s memo besides RelSet and RelSubset to track related RelSets
> > of a
> > >>> changing table (e.g. snapshot of the table at certain time, delta of
> > the
> > >>> table between two time points, etc.).
> > >>>
> > >>> [image: image.png]
> > >>>
> > >>> For example in the above figure, each vertical line is a TvrMetaSet
> > >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal
> lines
> > >>> represent time. Each black dot in the grid is a RelSet. Users can
> write
> > >> TVR
> > >>> Rewrite Rules to describe valid transformations between these dots.
> For
> > >>> example, the blues lines are inter-TVR rules that describe how to
> > compute
> > >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines are
> > >>> intra-TVR rules that describe transformations within a TVR. All TVR
> > >> rewrite
> > >>> rules are logical rules. All existing Calcite rules still work in the
> > new
> > >>> volcano system without modification.
> > >>>
> > >>> All changes in this feature will consist of four parts:
> > >>> 1. Memo extension with TvrMetaSet
> > >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and RelNodes,
> as
> > >>> well as links in between the nodes.
> > >>> 3. A basic set of TvrRules, written using the upgraded rule engine
> API.
> > >>> 4. Multi-query optimization, used to find the best incremental plan
> > >>> involving multiple time points.
> > >>>
> > >>> Note that this feature is an extension in nature and thus when
> > disabled,
> > >>> does not change any existing Calcite behavior.
> > >>>
> > >>> Other than scenarios in the paper, we also applied this
> > Calcite-extended
> > >>> incremental query optimizer to a type of periodic query called the
> > >> ‘‘range
> > >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of 80%
> on
> > >>> total CPU and memory consumption, and 60% on end-to-end execution
> time.
> > >>>
> > >>> All comments and suggestions are welcome. Thanks and happy holidays!
> > >>>
> > >>> Best,
> > >>> Botong
> > >>>
> > >>
> >
> >
>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to