Re: MaterializedViewRule is not applied for queries on aggregate views that truncate timestamp

Julian Hyde Thu, 03 Feb 2022 10:05:44 -0800

These kinds of roll ups are very common. They have simple mathematical 
properties. There is a simplifying abstraction to be found here. Please find it 
and use it.


Julian

> On Feb 2, 2022, at 7:47 PM, Thomas D'Silva <tdsi...@twilio.com.invalid> wrote:
> 
> Hi Stamatis,
> 
> Thanks for your response. Your suggestion of having use case specific rules
> that translate time ranges that align on the rollup of the materialized
> view makes sense.
> However if the time range doesn't align for eg. if we have a view that
> rolls up to the minute and the query predicate is "WHERE  ts >=
> timestamp'2018-01-01 00:01:30' AND ts < timestamp'2018-01-01 00:05:00'" we
> wouldn't be able to rewrite it to apply the FLOOR() on the left side of the
> comparators.
> In this case I was trying to use the MaterializedViewRule's union rewriting
> feature to query the view for the range "ts >= timestamp'2018-01-01
> 00:02:00' AND ts < timestamp'2018-01-01 00:05:00'" and query the table for
> the range "ts >= timestamp'2018-01-01 00:01:30' AND ts <
> timestamp'2018-01-01 00:02:00'". I would appreciate any suggestions on how
> to do this.
> 
> Thank you,
> Thomas
> 
> 
>> On Wed, Feb 2, 2022 at 3:28 AM Stamatis Zampetakis <zabe...@gmail.com>
>> wrote:
>> 
>> Hi Thomas,
>> 
>> I haven't looked into the implementation details of the rule but I assume
>> the rewriting does not kick in because there is no easy way to relate the
>> "ts" column that is used in the query with the "floor(ts to minute)" column
>> that is materialized.
>> 
>> I assume that if your query is/was similar to the one below you would get a
>> rewriting.
>> 
>> SELECT   floor(ts to minute), count(*)
>> FROM     events
>> WHERE   floor(ts to minute) >= timestamp'2018-01-01 00:01:00' AND floor(ts
>> to minute) < timestamp'2018-01-01 00:05:00'
>> GROUP BY eventid, floor(ts TO minute)
>> 
>> So far I don't see a limitation/problem with the MaterializedViewRule but
>> rather use-case specific rules you imply they always hold.
>> 
>> If I understand correctly you say that the following are kind of
>> equivalent:
>> 
>> WHERE ts >= timestamp'2018-01-01 00:01:00' AND ts <
>> timestamp'2018-01-01 00:05:00'
>> WHERE floor(ts to minute) >= timestamp'2018-01-01 00:01:00' AND floor(ts to
>> minute) < timestamp'2018-01-01 00:05:00'
>> WHERE floor(ts to minute) >= floor(timestamp'2018-01-01 00:01:00' to
>> minute) AND floor(ts to minute) < floor(timestamp'2018-01-01 00:05:00' to
>> minute)
>> 
>> and I think they are based on the given timestamp literals.
>> 
>> Maybe instead of changing the rules performing the view rewriting what you
>> should do is introduce other use-case specific rules that can be applied to
>> the input queries and bring them to the desired form.
>> 
>> Best,
>> Stamatis
>> 
>> On Fri, Jan 28, 2022 at 10:12 PM Thomas D'Silva <tdsi...@twilio.com.invalid
>>> 
>> wrote:
>> 
>>> Xurenhe,
>>> 
>>> Thank you for the detailed response and pointers. The use case you
>>> describe for OLAP streaming queries is similar to what our team is trying
>>> to accomplish.
>>> We use calcite to query data stored in kudu [1]. We use kafka streams to
>>> maintain materialized views that contain pre-aggregated data that rolls
>>> data up to the daily or hourly granularity. One of the problems we face
>> is
>>> when a user submits a query where the time range does not align to the
>> day
>>> or hour boundary.  In this case we are unable to use the materialized
>> views
>>> and have to query the raw event which is not performant because the table
>>> contains many more rows compared to the view.
>>> The reason I wanted to modify the MaterializedViewRule is to use the
>> "Union
>>> rewriting with aggregate" ability [2] this would allow us to use both the
>>> view and the table to answer such queries in the most efficient manner. I
>>> looked at the code in SubstitutionVisitor but I don't think there are
>>> existing rules there that do union rewriting with predicates.
>>> 
>>> The following is an example of the query we are trying to optimize. For
>> the
>>> materialized view
>>> SELECT   eventid,  floor(ts to minute), count(*) as cnt
>>> FROM     events
>>> GROUP BY eventid, floor(ts TO minute)
>>> 
>>> and query
>>> SELECT floor(ts to minute), count(*)
>>> FROM     events
>>> WHERE    ts >= timestamp'2018-01-01 00:01:30' AND ts <
>> timestamp'2018-01-01
>>> 00:05:00'
>>> GROUP BY eventid, floor(ts TO minute)
>>> 
>>> We would like to generate a plan that queries the table for the time
>> range
>>> [2018-01-01 00:01:30, 2018-01-01 00:02:00) and view for the time range
>>> [2018-01-01 00:02:-0, 2018-01-01 00:05:00) similar to
>>> 
>>> EnumerableCalc(expr#0..2=[{inputs}], EXPR$0=[$t1], EXPR$1=[$t2])
>>>  EnumerableAggregate(group=[{0, 1}], EXPR$1=[$SUM0($2)])
>>>    EnumerableUnion(all=[true])
>>>      EnumerableAggregate(group=[{0, 1}], EXPR$1=[COUNT()])
>>>        EnumerableCalc(expr#0..1=[{inputs}], expr#2=[FLAG(MINUTE)],
>>> expr#3=[FLOOR($t1, $t2)], expr#4=[Sarg[[2018-01-01 00:01:30..2018-01-01
>>> 00:02:00)]], expr#5=[SEARCH($t1, $t4)], eventid=[$t0], $f1=[$t3],
>>> $condition=[$t5])
>>>          EnumerableTableScan(table=[[hr, events]])
>>>      EnumerableCalc(expr#0..2=[{inputs}], expr#3=[Sarg[[2018-01-01
>>> 00:02:00..2018-01-01 00:05:00)]], expr#4=[SEARCH($t1, $t3)],
>>> proj#0..2=[{exprs}], $condition=[$t4])
>>>        EnumerableTableScan(table=[[hr, MV0]])
>>> 
>>> Thank you,
>>> Thomas
>>> 
>>> [1]
>> https://urldefense.com/v3/__https://github.com/twilio/calcite-kudu__;!!NCc8flgU!KhDSBwwmEP1lp8FVI18hL250_jEF94FehJiaGVmrZIs3Q79JqY5_37EGNyWrpN8$
>>> [2]
>>> 
>>> 
>> https://urldefense.com/v3/__https://calcite.apache.org/docs/materialized_views.html*union-rewriting-with-aggregate__;Iw!!NCc8flgU!KhDSBwwmEP1lp8FVI18hL250_jEF94FehJiaGVmrZIs3Q79JqY5_37EGqqSl83w$
>>> 
>>>> On Fri, Jan 28, 2022 at 12:20 AM 徐仁和 <xurenhe19910...@gmail.com> wrote:
>>> 
>>>> It’s useful in OLAP.
>>>> Time ’s prediction bottom `Aggregate` could be pulled-up, because of
>>> Time’s
>>>> granularity in Aggregate’s group.
>>>> By the way, the query can be rewritten by mv, when the query's
>>> granularity
>>>> is coarser than mv’s.
>>>> eg:
>>>> -- mv:
>>>> SELECT   eventid,  floor(ts TO minute), count(*) as cnt
>>>> FROM     events
>>>> GROUP BY eventid, floor(ts TO minute)
>>>> -- query:
>>>> SELECT   floor(ts TO hour), count(*)
>>>> FROM     events
>>>> GROUP BY eventid, floor(ts TO hour)
>>>> 
>>>> In the past, I tried to use stream-sql’s expression to solve it.
>>>> This document is used to describe this problem, Here is a rough
>> document.
>>>> doc:
>>>> 
>>>> 
>>> 
>> https://urldefense.com/v3/__https://docs.google.com/document/d/1LtSgxhwvnpk2uAXFQiMSX-aQjnkJ6KLrXwXJX-VbsbE/edit*heading=h.ww6moavc8lkj__;Iw!!NCc8flgU!Lx31-Z0cEnGqJQwX-7zk1Kk3NvAaCblUwfyTg4jDUg0Y69SULj4jHJKPGC2W58A$
>>>> My team has defined more `UnifyRule`s to solve it,
>>>> because SubstitutionVisitor has high-scalability.
>>>> 
>>>> 
>>>> Xurenhe
>>>> 
>>>> Thomas D'Silva <tdsi...@twilio.com.invalid> 于2022年1月27日周四 02:54写道：
>>>> 
>>>>> Hi,
>>>>> 
>>>>> The existing MaterializedViewRule is not applied when the
>> materialized
>>>> view
>>>>> does not have a view predicate but the query contains a predicate.
>> For
>>> eg
>>>>> for the following materialized view
>>>>> 
>>>>> SELECT   eventid,  floor(ts to minute), count(*) as cnt
>>>>> FROM     events
>>>>> GROUP BY eventid, floor(ts TO minute)
>>>>> 
>>>>> If we have the following query the view is not chosen in the final
>>> plan.
>>>>> 
>>>>> SELECT   floor(ts to minute), count(*)
>>>>> FROM     events
>>>>> WHERE    ts >= timestamp'2018-01-01 00:01:00' AND ts <
>>>> timestamp'2018-01-01
>>>>> 00:05:00'
>>>>> GROUP BY eventid, floor(ts TO minute)
>>>>> 
>>>>> If I add the predicate "ts > timestamp'2018-01-01 00:01:00' AND ts <
>>>>> timestamp'2018-01-01 00:05:00'" to the materialized view then the
>>>>> materialized view is chosen in the final plan. Would it be possible
>> to
>>>>> modify the MaterializedViewRule to automatically add a view predicate
>>>> based
>>>>> on the query when the materialized view being considerent aggregates
>>>> over a
>>>>> column that contains a floor()? This predicate would also need to be
>>>>> applied to the rewritten view. I would like to file a JIRA and work
>> on
>>>> this
>>>>> improvement if this makes sense.
>>>>> 
>>>>> Thank you,
>>>>> Thomas
>>>>> 
>>>> 
>>> 
>>

Re: MaterializedViewRule is not applied for queries on aggregate views that truncate timestamp

Reply via email to