Re: Proposal to extend Calcite into a incremental query optimizer

Zuozhi Wang Thu, 14 Jan 2021 11:11:05 -0800

Hi Rui,

Thanks a lot for the interest! Here are some explanations that can
hopefully answer your question.


-- For concrete implementations for merge operator and delta query for
joins/aggregation, the paper adopts algorithms proposed in prior works.

You are correct, many prior works have proposed various ways of doing
incremental computation. For the relational operators, the standard ways to
implement delta queries have been well studied in the literature.
We don't have a comprehensive rule list written out, but the following are
some most relevant papers.
For joins and common relational operators: "Algebraic change propagation
for semijoin and outerjoin queries"
https://dl.acm.org/doi/10.1145/290593.290597
For aggregations: "Incremental maintenance of views with duplicates"
https://dl.acm.org/doi/10.1145/223784.223849

-- Those implementations can be customized but some common ones can be
offered within Calcite.

Exactly. We have implemented the standard delta computation algorithms for
most operators and they support both append-only and retractable delta.
These are rewriting rules that we can offer within Calcite.
Relating to the question above, we believe our rule implementation for such
standard algorithms are straightforward and the documentation can help
developers understand them.

Based on the need, the developers can customize the rules by building on
top of existing implementations, or implement new delta rules using an API
very similar to the existing one.
Moreover, we also implemented more advanced algorithms like outer-join-view
maintenance and higher-order view maintenance that the developer can choose
to enable/disable freely.

Best Regards,
Zuozhi Wang

On Mon, Jan 11, 2021 at 11:29 PM Rui Wang <amaliu...@apache.org> wrote:

> Congratulations on your work! And thanks for considering contributing it
> to Calcite!
>
> I tried to have a pass on your paper, and have a question:
>
> It seems like for concrete implementations for merge operator and delta
> query for joins/aggregation, the paper adopts some algorithms and ideas
> that were proposed in prior works. Do you have a list of sources about
> those algorithms? I assume those implementations can be customized but some
> common ones can be offered within Calcite. If so, having a list might help
> people understand implementation details. (or maybe I should just read
> relevant references in the paper :-))
>
>
>
> -Rui
>
> On Fri, Jan 1, 2021 at 11:05 AM Botong Huang <pku...@gmail.com> wrote:
>
>> Hi Julian,
>>
>> Thanks for your interest! Sure let's figure out a plan that best benefits
>> the community. Here are some clarifications that hopefully answer your
>> questions.
>>
>> In our work (Tempura), users specify the set of time points to consider
>> running and a cost function that expresses users' preference over time,
>> Tempura will generate the best incremental plan that minimizes the overall
>> cost function.
>>
>> In this incremental plan, the sub-plans at different time points can be
>> different from each other, as opposed to identical plans in all delta runs
>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we can
>> mimic the current streaming implementation by specifying two (logical)
>> time
>> points in Tempura, representing the initial run and later delta runs
>> respectively. In general, note that Tempura supports various form of
>> incremental computing, not only the small-delta append-only data model in
>> streaming systems. That's why we believe Tempura subsumes the current
>> streaming support, as well as any IVM implementations.
>>
>> About the cost model, we did not come up with a seperate cost model, but
>> rather extended the existing one. Similar to multi-objective optimization,
>> costs incurred at different time points are considered different
>> dimensions. Tempura lets users supply a function that converts this cost
>> vector into a final cost. So under this function, any two incremental
>> plans
>> are still comparable and there is an overall optimum. I guess we can go
>> down the route of multi-objective parametric query optimization instead if
>> there is a need.
>>
>> Next on materialized views and multi-query optimization, since our
>> multi-time-point plan naturally involves materializing intermediate
>> results
>> for later time points, we need to solve the problem of choosing
>> materializations and include the cost of saving and reusing the
>> materializations when costing and comparing plans. We borrowed the
>> multi-query optimization techniques to solve this problem even though we
>> are looking at a single query. As a result, we think our work is
>> orthogonal
>> to Calcite's facilities around utilizing existing views, lattice etc. We
>> do
>> feel that the multi-query optimization component can be adopted to wider
>> use, but probably need more suggestions from the community.
>>
>> Lastly, our current implementation is set up in java code, it should be
>> straightforward to hook it up with SQL shell.
>>
>> Thanks,
>> Botong
>>
>> On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <jhyde.apa...@gmail.com>
>> wrote:
>>
>> > Botong,
>> >
>> > This is very exciting; congratulations on this research, and thank you
>> for
>> > contributing it back to Calcite.
>> >
>> > The research touches several areas in Calcite: streaming, materialized
>> > view maintenance, and multi-query optimization. As we have already some
>> > solutions in those areas (Sigma and Delta relational operators, lattice,
>> > and Spool operator), it will be interesting to see whether we can make
>> them
>> > compatible, or whether one concept can subsume others.
>> >
>> > Your work differs from streaming queries in that your relations are used
>> > by “external” user queries, whereas in pure streaming queries, the only
>> > activity is the change propagation. Did you find that you needed two
>> > separate cost models - one for “view maintenance” and another for “user
>> > queries” - since the objectives of each activity are so different?
>> >
>> > I wonder whether this work will hasten the arrival of multi-objective
>> > parametric query optimization [1] in Calcite.
>> >
>> > I will make time over the next few days to read and digest your paper.
>> > Then I expect that we will have a back-and-forth process to create
>> > something that will be useful for the broader community.
>> >
>> > One thing will be particularly useful: making this functionality
>> available
>> > from a SQL shell, so that people can experiment with this functionality
>> > without writing Java code or setting up complex databases and metadata.
>> I
>> > have in mind something like the simple DDL operations that are
>> available in
>> > Calcite’s ’server’ module. I wonder whether we could devise some kind of
>> > SQL syntax for a “multi-query”.
>> >
>> > Julian
>> >
>> > [1]
>> >
>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>> >
>> >
>> >
>> > > On Dec 23, 2020, at 8:55 PM, Botong Huang <pku...@gmail.com> wrote:
>> > >
>> > > Thanks Aron for pointing this out. To see the figure, please refer to
>> Fig
>> > > 3(a) in our paper:
>> > https://kai-zeng.github.io/papers/tempura-vldb2021.pdf
>> > >
>> > > Best,
>> > > Botong
>> > >
>> > > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <taojia...@gmail.com>
>> wrote:
>> > >
>> > >> Seems interesting, the pic can not be seen in the mail, may you open
>> a
>> > JIRA
>> > >> for this, people who are interested in this can subscribe to the
>> JIRA?
>> > >>
>> > >>
>> > >> Regards!
>> > >>
>> > >> Aron Tao
>> > >>
>> > >>
>> > >> Botong Huang <bot...@apache.org> 于2020年12月24日周四 上午3:18写道：
>> > >>
>> > >>> Hi all,
>> > >>>
>> > >>> This is a proposal to extend the Calcite optimizer into a general
>> > >>> incremental query optimizer, based on our research paper published
>> in
>> > >> VLDB
>> > >>> 2021:
>> > >>> Tempura: a general cost-based optimizer framework for incremental
>> data
>> > >>> processing
>> > >>>
>> > >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s data
>> > >>> warehouse is planning to use this incremental query optimizer to
>> > >> alleviate
>> > >>> cluster-wise resource skewness:
>> > >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental
>> > >> Computing
>> > >>>
>> > >>> To our best knowledge, this is the first general cost-based
>> incremental
>> > >>> optimizer that can find the best plan across multiple families of
>> > >>> incremental computing methods, including IVM, Streaming, DBToaster,
>> > etc.
>> > >>> Experiments (in the paper) shows that the generated best plan is
>> > >>> consistently much better than the plans from each individual method
>> > >> alone.
>> > >>>
>> > >>> In general, incremental query planning is central to database view
>> > >>> maintenance and stream processing systems, and are being adopted in
>> > >> active
>> > >>> databases, resumable query execution, approximate query processing,
>> > etc.
>> > >> We
>> > >>> are hoping that this feature can help widening the spectrum of
>> Calcite,
>> > >>> solicit more use cases and adoption of Calcite.
>> > >>>
>> > >>> Below is a brief description of the technical details. Please refer
>> to
>> > >> the
>> > >>> Tempura paper for more details. We are also working on a journal
>> > version
>> > >> of
>> > >>> the paper with more implementation details.
>> > >>>
>> > >>> Currently the query plan generated by Calcite is meant to be
>> executed
>> > >>> altogether at once. In the proposal, Calcite’s memo will be extended
>> > with
>> > >>> temporal information so that it is capable of generating incremental
>> > >> plans
>> > >>> that include multiple sub-plans to execute at different time points.
>> > >>>
>> > >>> The main idea is to view each table as one that changes over time
>> (Time
>> > >>> Varying Relations (TVR)). To achieve that we introduced TvrMetaSet
>> into
>> > >>> Calcite’s memo besides RelSet and RelSubset to track related RelSets
>> > of a
>> > >>> changing table (e.g. snapshot of the table at certain time, delta of
>> > the
>> > >>> table between two time points, etc.).
>> > >>>
>> > >>> [image: image.png]
>> > >>>
>> > >>> For example in the above figure, each vertical line is a TvrMetaSet
>> > >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal
>> lines
>> > >>> represent time. Each black dot in the grid is a RelSet. Users can
>> write
>> > >> TVR
>> > >>> Rewrite Rules to describe valid transformations between these dots.
>> For
>> > >>> example, the blues lines are inter-TVR rules that describe how to
>> > compute
>> > >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines
>> are
>> > >>> intra-TVR rules that describe transformations within a TVR. All TVR
>> > >> rewrite
>> > >>> rules are logical rules. All existing Calcite rules still work in
>> the
>> > new
>> > >>> volcano system without modification.
>> > >>>
>> > >>> All changes in this feature will consist of four parts:
>> > >>> 1. Memo extension with TvrMetaSet
>> > >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and
>> RelNodes, as
>> > >>> well as links in between the nodes.
>> > >>> 3. A basic set of TvrRules, written using the upgraded rule engine
>> API.
>> > >>> 4. Multi-query optimization, used to find the best incremental plan
>> > >>> involving multiple time points.
>> > >>>
>> > >>> Note that this feature is an extension in nature and thus when
>> > disabled,
>> > >>> does not change any existing Calcite behavior.
>> > >>>
>> > >>> Other than scenarios in the paper, we also applied this
>> > Calcite-extended
>> > >>> incremental query optimizer to a type of periodic query called the
>> > >> ‘‘range
>> > >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of
>> 80% on
>> > >>> total CPU and memory consumption, and 60% on end-to-end execution
>> time.
>> > >>>
>> > >>> All comments and suggestions are welcome. Thanks and happy holidays!
>> > >>>
>> > >>> Best,
>> > >>> Botong
>> > >>>
>> > >>
>> >
>> >
>>
>

Re: Proposal to extend Calcite into a incremental query optimizer

Reply via email to