Re: Proposal to extend Calcite into a incremental query optimizer

2021-01-28 Thread Julian Hyde
PS The “editable doc” that Rui refers to is also a good idea. I think we should 
create it to continue discussion after the first meeting. 

Julian

> On Jan 28, 2021, at 11:16 AM, Julian Hyde  wrote:
> 
> I think good next steps would be a PR and a meeting. The PR will allow us to 
> read the code, but I think we should do the first round of questions at the 
> meeting.  The meeting could perhaps start with a presentation of the paper 
> (do you have some slides you are planning to present at VLDB, Botong?) and 
> then move on to questions about the concepts, which alternatives were 
> considered, and how the concepts map onto other current and future concepts 
> in calcite. 
> 
> I don’t think we should start “reviewing” the PR line-by-line at this point. 
> We need to understand the high-level concepts and design choices. If we start 
> reviewing the PR we will get lost in the details. 
> 
> I know that integrating a major change is hard; I doubt that we will be able 
> to integrate everything, but we can build understanding about where calcite 
> needs to go, and I hope integrate a good amount of code to help us get there.
> 
> As I said before, after the integration I would like people to be able to 
> experiment with it and use it in their production systems.  That way, it will 
> not be an experiment that withers, but a feature set integrates with other 
> calcite features and gets stronger over time. 
> 
> Julian
> 
>> On Jan 28, 2021, at 10:54 AM, Rui Wang  wrote:
>> 
>> For me to participate in the discussion for the above questions, I will
>> need to read a lot more to know relevant context and likely ask lots of
>> questions :-).  A editable doc is probably good for questions and back and
>> forward discussion.
>> 
>> 
>> -Rui
>> 
 On Thu, Jan 28, 2021 at 10:50 AM Rui Wang  wrote:
>>> 
>>> I am also happy to help push this work into Calcite (review code and doc,
>>> etc.).
>>> 
>>> While you can share your code so people can have more idea how it is
>>> implemented, I think it would be also nice to have a doc to discuss open
>>> questions above. Some points that I copy those to here:
>>> 
>>> 1. Can this solution be compatible with existing solutions in Calcite
>>> Streaming, materialized view maintenance, and multi-query optimization
>>> (Sigma and Delta relational operators, lattice, and Spool operator),
>>> 2. Did you find that you needed two separate cost models - one for “view
>>> maintenance” and another for “user queries” - since the objectives of each
>>> activity are so different?
>>> 3. whether this work will hasten the arrival of multi-objective parametric
>>> query optimization [1] in Calcite.
>>> 4. probably SQL shell support.
>>> 
>>> 
>>> [1]:
>>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>>> 
>>> 
>>> -Rui
>>> 
>>> 
>>> 
 On Wed, Jan 27, 2021 at 6:52 PM Albert  wrote:
 
 it would be very nice to see a POC of your work.
 
 
> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang  wrote:
 
> Hi Julian,
> 
> Just wondering if there are any updates? We are wondering if it would
 help
> to post our code for a quick preview.
> 
> Thanks,
> Botong
> 
> On Fri, Jan 1, 2021 at 11:04 AM Botong Huang  wrote:
> 
>> Hi Julian,
>> 
>> Thanks for your interest! Sure let's figure out a plan that best
 benefits
>> the community. Here are some clarifications that hopefully answer your
>> questions.
>> 
>> In our work (Tempura), users specify the set of time points to
 consider
>> running and a cost function that expresses users' preference over
 time,
>> Tempura will generate the best incremental plan that minimizes the
> overall
>> cost function.
>> 
>> In this incremental plan, the sub-plans at different time points can
 be
>> different from each other, as opposed to identical plans in all delta
> runs
>> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we
 can
>> mimic the current streaming implementation by specifying two (logical)
> time
>> points in Tempura, representing the initial run and later delta runs
>> respectively. In general, note that Tempura supports various form of
>> incremental computing, not only the small-delta append-only data
 model in
>> streaming systems. That's why we believe Tempura subsumes the current
>> streaming support, as well as any IVM implementations.
>> 
>> About the cost model, we did not come up with a seperate cost model,
 but
>> rather extended the existing one. Similar to multi-objective
> optimization,
>> costs incurred at different time points are considered different
>> dimensions. Tempura lets users supply a function that converts this
 cost
>> vector into a final cost. So under this function, any two incremental
> plans
>> are still comparable 

Re: Proposal to extend Calcite into a incremental query optimizer

2021-01-28 Thread Julian Hyde
I think good next steps would be a PR and a meeting. The PR will allow us to 
read the code, but I think we should do the first round of questions at the 
meeting.  The meeting could perhaps start with a presentation of the paper (do 
you have some slides you are planning to present at VLDB, Botong?) and then 
move on to questions about the concepts, which alternatives were considered, 
and how the concepts map onto other current and future concepts in calcite. 

I don’t think we should start “reviewing” the PR line-by-line at this point. We 
need to understand the high-level concepts and design choices. If we start 
reviewing the PR we will get lost in the details. 

I know that integrating a major change is hard; I doubt that we will be able to 
integrate everything, but we can build understanding about where calcite needs 
to go, and I hope integrate a good amount of code to help us get there.

As I said before, after the integration I would like people to be able to 
experiment with it and use it in their production systems.  That way, it will 
not be an experiment that withers, but a feature set integrates with other 
calcite features and gets stronger over time. 

Julian

> On Jan 28, 2021, at 10:54 AM, Rui Wang  wrote:
> 
> For me to participate in the discussion for the above questions, I will
> need to read a lot more to know relevant context and likely ask lots of
> questions :-).  A editable doc is probably good for questions and back and
> forward discussion.
> 
> 
> -Rui
> 
>> On Thu, Jan 28, 2021 at 10:50 AM Rui Wang  wrote:
>> 
>> I am also happy to help push this work into Calcite (review code and doc,
>> etc.).
>> 
>> While you can share your code so people can have more idea how it is
>> implemented, I think it would be also nice to have a doc to discuss open
>> questions above. Some points that I copy those to here:
>> 
>> 1. Can this solution be compatible with existing solutions in Calcite
>> Streaming, materialized view maintenance, and multi-query optimization
>> (Sigma and Delta relational operators, lattice, and Spool operator),
>> 2. Did you find that you needed two separate cost models - one for “view
>> maintenance” and another for “user queries” - since the objectives of each
>> activity are so different?
>> 3. whether this work will hasten the arrival of multi-objective parametric
>> query optimization [1] in Calcite.
>> 4. probably SQL shell support.
>> 
>> 
>> [1]:
>> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>> 
>> 
>> -Rui
>> 
>> 
>> 
>>> On Wed, Jan 27, 2021 at 6:52 PM Albert  wrote:
>>> 
>>> it would be very nice to see a POC of your work.
>>> 
>>> 
 On Thu, Jan 28, 2021 at 10:21 AM Botong Huang  wrote:
>>> 
 Hi Julian,
 
 Just wondering if there are any updates? We are wondering if it would
>>> help
 to post our code for a quick preview.
 
 Thanks,
 Botong
 
 On Fri, Jan 1, 2021 at 11:04 AM Botong Huang  wrote:
 
> Hi Julian,
> 
> Thanks for your interest! Sure let's figure out a plan that best
>>> benefits
> the community. Here are some clarifications that hopefully answer your
> questions.
> 
> In our work (Tempura), users specify the set of time points to
>>> consider
> running and a cost function that expresses users' preference over
>>> time,
> Tempura will generate the best incremental plan that minimizes the
 overall
> cost function.
> 
> In this incremental plan, the sub-plans at different time points can
>>> be
> different from each other, as opposed to identical plans in all delta
 runs
> as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we
>>> can
> mimic the current streaming implementation by specifying two (logical)
 time
> points in Tempura, representing the initial run and later delta runs
> respectively. In general, note that Tempura supports various form of
> incremental computing, not only the small-delta append-only data
>>> model in
> streaming systems. That's why we believe Tempura subsumes the current
> streaming support, as well as any IVM implementations.
> 
> About the cost model, we did not come up with a seperate cost model,
>>> but
> rather extended the existing one. Similar to multi-objective
 optimization,
> costs incurred at different time points are considered different
> dimensions. Tempura lets users supply a function that converts this
>>> cost
> vector into a final cost. So under this function, any two incremental
 plans
> are still comparable and there is an overall optimum. I guess we can
>>> go
> down the route of multi-objective parametric query optimization
>>> instead
 if
> there is a need.
> 
> Next on materialized views and multi-query optimization, since our
> multi-time-point plan naturally involves materializing intermediate
 results
> for later 

Re: Proposal to extend Calcite into a incremental query optimizer

2021-01-28 Thread Rui Wang
For me to participate in the discussion for the above questions, I will
need to read a lot more to know relevant context and likely ask lots of
questions :-).  A editable doc is probably good for questions and back and
forward discussion.


-Rui

On Thu, Jan 28, 2021 at 10:50 AM Rui Wang  wrote:

> I am also happy to help push this work into Calcite (review code and doc,
> etc.).
>
> While you can share your code so people can have more idea how it is
> implemented, I think it would be also nice to have a doc to discuss open
> questions above. Some points that I copy those to here:
>
> 1. Can this solution be compatible with existing solutions in Calcite
> Streaming, materialized view maintenance, and multi-query optimization
> (Sigma and Delta relational operators, lattice, and Spool operator),
> 2. Did you find that you needed two separate cost models - one for “view
> maintenance” and another for “user queries” - since the objectives of each
> activity are so different?
> 3. whether this work will hasten the arrival of multi-objective parametric
> query optimization [1] in Calcite.
> 4. probably SQL shell support.
>
>
> [1]:
> https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext
>
>
> -Rui
>
>
>
> On Wed, Jan 27, 2021 at 6:52 PM Albert  wrote:
>
>> it would be very nice to see a POC of your work.
>>
>>
>> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang  wrote:
>>
>> > Hi Julian,
>> >
>> > Just wondering if there are any updates? We are wondering if it would
>> help
>> > to post our code for a quick preview.
>> >
>> > Thanks,
>> > Botong
>> >
>> > On Fri, Jan 1, 2021 at 11:04 AM Botong Huang  wrote:
>> >
>> > > Hi Julian,
>> > >
>> > > Thanks for your interest! Sure let's figure out a plan that best
>> benefits
>> > > the community. Here are some clarifications that hopefully answer your
>> > > questions.
>> > >
>> > > In our work (Tempura), users specify the set of time points to
>> consider
>> > > running and a cost function that expresses users' preference over
>> time,
>> > > Tempura will generate the best incremental plan that minimizes the
>> > overall
>> > > cost function.
>> > >
>> > > In this incremental plan, the sub-plans at different time points can
>> be
>> > > different from each other, as opposed to identical plans in all delta
>> > runs
>> > > as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we
>> can
>> > > mimic the current streaming implementation by specifying two (logical)
>> > time
>> > > points in Tempura, representing the initial run and later delta runs
>> > > respectively. In general, note that Tempura supports various form of
>> > > incremental computing, not only the small-delta append-only data
>> model in
>> > > streaming systems. That's why we believe Tempura subsumes the current
>> > > streaming support, as well as any IVM implementations.
>> > >
>> > > About the cost model, we did not come up with a seperate cost model,
>> but
>> > > rather extended the existing one. Similar to multi-objective
>> > optimization,
>> > > costs incurred at different time points are considered different
>> > > dimensions. Tempura lets users supply a function that converts this
>> cost
>> > > vector into a final cost. So under this function, any two incremental
>> > plans
>> > > are still comparable and there is an overall optimum. I guess we can
>> go
>> > > down the route of multi-objective parametric query optimization
>> instead
>> > if
>> > > there is a need.
>> > >
>> > > Next on materialized views and multi-query optimization, since our
>> > > multi-time-point plan naturally involves materializing intermediate
>> > results
>> > > for later time points, we need to solve the problem of choosing
>> > > materializations and include the cost of saving and reusing the
>> > > materializations when costing and comparing plans. We borrowed the
>> > > multi-query optimization techniques to solve this problem even though
>> we
>> > > are looking at a single query. As a result, we think our work is
>> > orthogonal
>> > > to Calcite's facilities around utilizing existing views, lattice etc.
>> We
>> > do
>> > > feel that the multi-query optimization component can be adopted to
>> wider
>> > > use, but probably need more suggestions from the community.
>> > >
>> > > Lastly, our current implementation is set up in java code, it should
>> be
>> > > straightforward to hook it up with SQL shell.
>> > >
>> > > Thanks,
>> > > Botong
>> > >
>> > > On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde 
>> > > wrote:
>> > >
>> > >> Botong,
>> > >>
>> > >> This is very exciting; congratulations on this research, and thank
>> you
>> > >> for contributing it back to Calcite.
>> > >>
>> > >> The research touches several areas in Calcite: streaming,
>> materialized
>> > >> view maintenance, and multi-query optimization. As we have already
>> some
>> > >> solutions in those areas (Sigma and Delta relational operators,
>> lattice,
>> > >> and Spool operator), it 

Re: Proposal to extend Calcite into a incremental query optimizer

2021-01-28 Thread Rui Wang
I am also happy to help push this work into Calcite (review code and doc,
etc.).

While you can share your code so people can have more idea how it is
implemented, I think it would be also nice to have a doc to discuss open
questions above. Some points that I copy those to here:

1. Can this solution be compatible with existing solutions in Calcite
Streaming, materialized view maintenance, and multi-query optimization
(Sigma and Delta relational operators, lattice, and Spool operator),
2. Did you find that you needed two separate cost models - one for “view
maintenance” and another for “user queries” - since the objectives of each
activity are so different?
3. whether this work will hasten the arrival of multi-objective parametric
query optimization [1] in Calcite.
4. probably SQL shell support.


[1]:
https://cacm.acm.org/magazines/2017/10/221322-multi-objective-parametric-query-optimization/fulltext


-Rui



On Wed, Jan 27, 2021 at 6:52 PM Albert  wrote:

> it would be very nice to see a POC of your work.
>
>
> On Thu, Jan 28, 2021 at 10:21 AM Botong Huang  wrote:
>
> > Hi Julian,
> >
> > Just wondering if there are any updates? We are wondering if it would
> help
> > to post our code for a quick preview.
> >
> > Thanks,
> > Botong
> >
> > On Fri, Jan 1, 2021 at 11:04 AM Botong Huang  wrote:
> >
> > > Hi Julian,
> > >
> > > Thanks for your interest! Sure let's figure out a plan that best
> benefits
> > > the community. Here are some clarifications that hopefully answer your
> > > questions.
> > >
> > > In our work (Tempura), users specify the set of time points to consider
> > > running and a cost function that expresses users' preference over time,
> > > Tempura will generate the best incremental plan that minimizes the
> > overall
> > > cost function.
> > >
> > > In this incremental plan, the sub-plans at different time points can be
> > > different from each other, as opposed to identical plans in all delta
> > runs
> > > as in streaming or IVM. As mentioned in $2.1 of the Tempura paper, we
> can
> > > mimic the current streaming implementation by specifying two (logical)
> > time
> > > points in Tempura, representing the initial run and later delta runs
> > > respectively. In general, note that Tempura supports various form of
> > > incremental computing, not only the small-delta append-only data model
> in
> > > streaming systems. That's why we believe Tempura subsumes the current
> > > streaming support, as well as any IVM implementations.
> > >
> > > About the cost model, we did not come up with a seperate cost model,
> but
> > > rather extended the existing one. Similar to multi-objective
> > optimization,
> > > costs incurred at different time points are considered different
> > > dimensions. Tempura lets users supply a function that converts this
> cost
> > > vector into a final cost. So under this function, any two incremental
> > plans
> > > are still comparable and there is an overall optimum. I guess we can go
> > > down the route of multi-objective parametric query optimization instead
> > if
> > > there is a need.
> > >
> > > Next on materialized views and multi-query optimization, since our
> > > multi-time-point plan naturally involves materializing intermediate
> > results
> > > for later time points, we need to solve the problem of choosing
> > > materializations and include the cost of saving and reusing the
> > > materializations when costing and comparing plans. We borrowed the
> > > multi-query optimization techniques to solve this problem even though
> we
> > > are looking at a single query. As a result, we think our work is
> > orthogonal
> > > to Calcite's facilities around utilizing existing views, lattice etc.
> We
> > do
> > > feel that the multi-query optimization component can be adopted to
> wider
> > > use, but probably need more suggestions from the community.
> > >
> > > Lastly, our current implementation is set up in java code, it should be
> > > straightforward to hook it up with SQL shell.
> > >
> > > Thanks,
> > > Botong
> > >
> > > On Mon, Dec 28, 2020 at 6:44 PM Julian Hyde 
> > > wrote:
> > >
> > >> Botong,
> > >>
> > >> This is very exciting; congratulations on this research, and thank you
> > >> for contributing it back to Calcite.
> > >>
> > >> The research touches several areas in Calcite: streaming, materialized
> > >> view maintenance, and multi-query optimization. As we have already
> some
> > >> solutions in those areas (Sigma and Delta relational operators,
> lattice,
> > >> and Spool operator), it will be interesting to see whether we can make
> > them
> > >> compatible, or whether one concept can subsume others.
> > >>
> > >> Your work differs from streaming queries in that your relations are
> used
> > >> by “external” user queries, whereas in pure streaming queries, the
> only
> > >> activity is the change propagation. Did you find that you needed two
> > >> separate cost models - one for “view maintenance” and another for
> “user
> > >> queries” 

Re: Adding new operators to Calcite

2021-01-28 Thread Rui Wang
I think your ANY operator's signature is ANY(column_name, )?  In this
case you might use Descriptor operator as an example to see how does a
operator that accepts column name work:
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql/SqlDescriptorOperator.java


-Rui

On Wed, Jan 27, 2021 at 10:49 PM Lana Ramjit  wrote:

> Hi all,
> Apologies in advance for the longer e-mail! I am a grad student adapting
> Calcite for use in a research project prototype.  The functionality I am
> trying to add involves inserting non-executable operators into a plan to
> represent abstract sets of queries.
> For example, it takes two queries like,
>
> "select a from t"
> and
> "select b from t"
>
> and combines them into:
>
> "select any {a, b} from t".
> Similarly, an "any{}" operator can have, e.g. a list of filtering
> conditions and other grammatical options. These queries are not intended to
> ever be executed and only exist for us to write planning rules over!
>
> I am able to parse statements like the one above, but I am having a great
> bit of trouble trying to validate and convert to a relational expression.
> The output of the convertSqlToRel test I added is:
>
> LogicalProject(EXPR$0=[ANY($1)])
>  LogicalTableScan(table=[(t1, t2, t3)])
>
> I created a SqlNode named SqlAny and a RexNode RexAny that is parsed
> correctly, but after validation it is converted to a SqlBasicCall. I am
> able to catch it by examining the operator type in
> convertExtendedExpression() and return a RexAny node, but I am not sure
> what to add to the validator to avoid this hack and preserve the
> fieldnames that are operands to SqlAny.
>
> Please, any and all help is appreciated!
> Cheers,
> Lana
>


Re: custom Operator or customExpression

2021-01-28 Thread 盛森林
I have solve the problem.thank you wang



发自我的iPhone


-- Original --
From: Rui Wang