Congratulations on your work! And thanks for considering contributing it to Calcite!
I tried to have a pass on your paper, and have a question: It seems like for concrete implementations for merge operator and delta query for joins/aggregation, the paper adopts some algorithms and ideas that were proposed in prior works. Do you have a list of sources about those algorithms? I assume those implementations can be customized but some common ones can be offered within Calcite. If so, having a list might help people understand implementation details. (or maybe I should just read relevant references in the paper :-)) -Rui On Fri, Jan 1, 2021 at 11:05 AM Botong Huang <pku...@gmail.com> wrote: > Hi Julian, > > Thanks for your interest! Sure let's figure out a plan that best benefits > the community. Here are some clarifications that hopefully answer your > questions. > > In our work (Tempura), users specify the set of time points to consider > running and a cost function that expresses users' preference over time, > Tempura will generate the best incremental plan that minimizes the overall > cost function. > > In this incremental plan, the sub-plans at different time points can be > different from each other, as opposed to identical plans in all delta runs > as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we can > mimic the current streaming implementation by specifying two (logical) time > points in Tempura, representing the initial run and later delta runs > respectively. In general, note that Tempura supports various form of > incremental computing, not only the small-delta append-only data model in > streaming systems. That's why we believe Tempura subsumes the current > streaming support, as well as any IVM implementations. > > About the cost model, we did not come up with a seperate cost model, but > rather extended the existing one. Similar to multi-objective optimization, > costs incurred at different time points are considered different > dimensions. Tempura lets users supply a function that converts this cost > vector into a final cost. So under this function, any two incremental plans > are still comparable and there is an overall optimum. I guess we can go > down the route of multi-objective parametric query optimization instead if > there is a need. > > Next on materialized views and multi-query optimization, since our > multi-time-point plan naturally involves materializing intermediate results > for later time points, we need to solve the problem of choosing > materializations and include the cost of saving and reusing the > materializations when costing and comparing plans. We borrowed the > multi-query optimization techniques to solve this problem even though we > are looking at a single query. As a result, we think our work is orthogonal > to Calcite's facilities around utilizing existing views, lattice etc. We do > feel that the multi-query optimization component can be adopted to wider > use, but probably need more suggestions from the community. > > Lastly, our current implementation is set up in java code, it should be > straightforward to hook it up with SQL shell. > > Thanks, > Botong > > On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde <jhyde.apa...@gmail.com> > wrote: > > > Botong, > > > > This is very exciting; congratulations on this research, and thank you > for > > contributing it back to Calcite. > > > > The research touches several areas in Calcite: streaming, materialized > > view maintenance, and multi-query optimization. As we have already some > > solutions in those areas (Sigma and Delta relational operators, lattice, > > and Spool operator), it will be interesting to see whether we can make > them > > compatible, or whether one concept can subsume others. > > > > Your work differs from streaming queries in that your relations are used > > by “external” user queries, whereas in pure streaming queries, the only > > activity is the change propagation. Did you find that you needed two > > separate cost models - one for “view maintenance” and another for “user > > queries” - since the objectives of each activity are so different? > > > > I wonder whether this work will hasten the arrival of multi-objective > > parametric query optimization [1] in Calcite. > > > > I will make time over the next few days to read and digest your paper. > > Then I expect that we will have a back-and-forth process to create > > something that will be useful for the broader community. > > > > One thing will be particularly useful: making this functionality > available > > from a SQL shell, so that people can experiment with this functionality > > without writing Java code or setting up complex databases and metadata. I > > have in mind something like the simple DDL operations that are available > in > > Calcite’s ’server’ module. I wonder whether we could devise some kind of > > SQL syntax for a “multi-query”. > > > > Julian > > > > [1] > > > https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext > > > > > > > > > On Dec 23, 2020, at 8:55 PM, Botong Huang <pku...@gmail.com> wrote: > > > > > > Thanks Aron for pointing this out. To see the figure, please refer to > Fig > > > 3(a) in our paper: > > https://kai-zeng.github.io/papers/tempura-vldb2021.pdf > > > > > > Best, > > > Botong > > > > > > On Wed, Dec 23, 2020 at 7:20 PM JiaTao Tao <taojia...@gmail.com> > wrote: > > > > > >> Seems interesting, the pic can not be seen in the mail, may you open a > > JIRA > > >> for this, people who are interested in this can subscribe to the JIRA? > > >> > > >> > > >> Regards! > > >> > > >> Aron Tao > > >> > > >> > > >> Botong Huang <bot...@apache.org> 于2020年12月24日周四 上午3:18写道: > > >> > > >>> Hi all, > > >>> > > >>> This is a proposal to extend the Calcite optimizer into a general > > >>> incremental query optimizer, based on our research paper published in > > >> VLDB > > >>> 2021: > > >>> Tempura: a general cost-based optimizer framework for incremental > data > > >>> processing > > >>> > > >>> We also have a demo in SIGMOD 2020 illustrating how Alibaba’s data > > >>> warehouse is planning to use this incremental query optimizer to > > >> alleviate > > >>> cluster-wise resource skewness: > > >>> Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental > > >> Computing > > >>> > > >>> To our best knowledge, this is the first general cost-based > incremental > > >>> optimizer that can find the best plan across multiple families of > > >>> incremental computing methods, including IVM, Streaming, DBToaster, > > etc. > > >>> Experiments (in the paper) shows that the generated best plan is > > >>> consistently much better than the plans from each individual method > > >> alone. > > >>> > > >>> In general, incremental query planning is central to database view > > >>> maintenance and stream processing systems, and are being adopted in > > >> active > > >>> databases, resumable query execution, approximate query processing, > > etc. > > >> We > > >>> are hoping that this feature can help widening the spectrum of > Calcite, > > >>> solicit more use cases and adoption of Calcite. > > >>> > > >>> Below is a brief description of the technical details. Please refer > to > > >> the > > >>> Tempura paper for more details. We are also working on a journal > > version > > >> of > > >>> the paper with more implementation details. > > >>> > > >>> Currently the query plan generated by Calcite is meant to be executed > > >>> altogether at once. In the proposal, Calcite’s memo will be extended > > with > > >>> temporal information so that it is capable of generating incremental > > >> plans > > >>> that include multiple sub-plans to execute at different time points. > > >>> > > >>> The main idea is to view each table as one that changes over time > (Time > > >>> Varying Relations (TVR)). To achieve that we introduced TvrMetaSet > into > > >>> Calcite’s memo besides RelSet and RelSubset to track related RelSets > > of a > > >>> changing table (e.g. snapshot of the table at certain time, delta of > > the > > >>> table between two time points, etc.). > > >>> > > >>> [image: image.png] > > >>> > > >>> For example in the above figure, each vertical line is a TvrMetaSet > > >>> representing a TVR (S, R, S left outer join R, etc.). Horizontal > lines > > >>> represent time. Each black dot in the grid is a RelSet. Users can > write > > >> TVR > > >>> Rewrite Rules to describe valid transformations between these dots. > For > > >>> example, the blues lines are inter-TVR rules that describe how to > > compute > > >>> certain RelSet of a TVR from RelSets of other TVRs. The red lines are > > >>> intra-TVR rules that describe transformations within a TVR. All TVR > > >> rewrite > > >>> rules are logical rules. All existing Calcite rules still work in the > > new > > >>> volcano system without modification. > > >>> > > >>> All changes in this feature will consist of four parts: > > >>> 1. Memo extension with TvrMetaSet > > >>> 2. Rule engine upgrade, capable of matching TvrMetaSet and RelNodes, > as > > >>> well as links in between the nodes. > > >>> 3. A basic set of TvrRules, written using the upgraded rule engine > API. > > >>> 4. Multi-query optimization, used to find the best incremental plan > > >>> involving multiple time points. > > >>> > > >>> Note that this feature is an extension in nature and thus when > > disabled, > > >>> does not change any existing Calcite behavior. > > >>> > > >>> Other than scenarios in the paper, we also applied this > > Calcite-extended > > >>> incremental query optimizer to a type of periodic query called the > > >> ‘‘range > > >>> query’’ in Alibaba’s data warehouse. It achieved cost savings of 80% > on > > >>> total CPU and memory consumption, and 60% on end-to-end execution > time. > > >>> > > >>> All comments and suggestions are welcome. Thanks and happy holidays! > > >>> > > >>> Best, > > >>> Botong > > >>> > > >> > > > > >