Xurenhe,

Thank you for the detailed response and pointers. The use case you
describe for OLAP streaming queries is similar to what our team is trying
to accomplish.
We use calcite to query data stored in kudu [1]. We use kafka streams to
maintain materialized views that contain pre-aggregated data that rolls
data up to the daily or hourly granularity. One of the problems we face is
when a user submits a query where the time range does not align to the day
or hour boundary.  In this case we are unable to use the materialized views
and have to query the raw event which is not performant because the table
contains many more rows compared to the view.
The reason I wanted to modify the MaterializedViewRule is to use the "Union
rewriting with aggregate" ability [2] this would allow us to use both the
view and the table to answer such queries in the most efficient manner. I
looked at the code in SubstitutionVisitor but I don't think there are
existing rules there that do union rewriting with predicates.

The following is an example of the query we are trying to optimize. For the
materialized view
SELECT   eventid,  floor(ts to minute), count(*) as cnt
FROM     events
GROUP BY eventid, floor(ts TO minute)

and query
SELECT floor(ts to minute), count(*)
FROM     events
WHERE    ts >= timestamp'2018-01-01 00:01:30' AND ts < timestamp'2018-01-01
00:05:00'
GROUP BY eventid, floor(ts TO minute)

We would like to generate a plan that queries the table for the time range
[2018-01-01 00:01:30, 2018-01-01 00:02:00) and view for the time range
[2018-01-01 00:02:-0, 2018-01-01 00:05:00) similar to

EnumerableCalc(expr#0..2=[{inputs}], EXPR$0=[$t1], EXPR$1=[$t2])
  EnumerableAggregate(group=[{0, 1}], EXPR$1=[$SUM0($2)])
    EnumerableUnion(all=[true])
      EnumerableAggregate(group=[{0, 1}], EXPR$1=[COUNT()])
        EnumerableCalc(expr#0..1=[{inputs}], expr#2=[FLAG(MINUTE)],
expr#3=[FLOOR($t1, $t2)], expr#4=[Sarg[[2018-01-01 00:01:30..2018-01-01
00:02:00)]], expr#5=[SEARCH($t1, $t4)], eventid=[$t0], $f1=[$t3],
$condition=[$t5])
          EnumerableTableScan(table=[[hr, events]])
      EnumerableCalc(expr#0..2=[{inputs}], expr#3=[Sarg[[2018-01-01
00:02:00..2018-01-01 00:05:00)]], expr#4=[SEARCH($t1, $t3)],
proj#0..2=[{exprs}], $condition=[$t4])
        EnumerableTableScan(table=[[hr, MV0]])

Thank you,
Thomas

[1] https://github.com/twilio/calcite-kudu
[2]
https://calcite.apache.org/docs/materialized_views.html#union-rewriting-with-aggregate

On Fri, Jan 28, 2022 at 12:20 AM 徐仁和 <[email protected]> wrote:

> It’s useful in OLAP.
> Time ’s prediction bottom `Aggregate` could be pulled-up, because of Time’s
> granularity in Aggregate’s group.
> By the way, the query can be rewritten by mv, when the query's granularity
> is coarser than mv’s.
> eg:
> -- mv:
> SELECT   eventid,  floor(ts TO minute), count(*) as cnt
> FROM     events
> GROUP BY eventid, floor(ts TO minute)
> -- query:
> SELECT   floor(ts TO hour), count(*)
> FROM     events
> GROUP BY eventid, floor(ts TO hour)
>
> In the past, I tried to use stream-sql’s expression to solve it.
> This document is used to describe this problem, Here is a rough document.
> doc:
>
> https://urldefense.com/v3/__https://docs.google.com/document/d/1LtSgxhwvnpk2uAXFQiMSX-aQjnkJ6KLrXwXJX-VbsbE/edit*heading=h.ww6moavc8lkj__;Iw!!NCc8flgU!Lx31-Z0cEnGqJQwX-7zk1Kk3NvAaCblUwfyTg4jDUg0Y69SULj4jHJKPGC2W58A$
> My team has defined more `UnifyRule`s to solve it,
> because SubstitutionVisitor has high-scalability.
>
>
> Xurenhe
>
> Thomas D'Silva <[email protected]> 于2022年1月27日周四 02:54写道:
>
> > Hi,
> >
> > The existing MaterializedViewRule is not applied when the materialized
> view
> > does not have a view predicate but the query contains a predicate. For eg
> > for the following materialized view
> >
> > SELECT   eventid,  floor(ts to minute), count(*) as cnt
> > FROM     events
> > GROUP BY eventid, floor(ts TO minute)
> >
> > If we have the following query the view is not chosen in the final plan.
> >
> > SELECT   floor(ts to minute), count(*)
> > FROM     events
> > WHERE    ts >= timestamp'2018-01-01 00:01:00' AND ts <
> timestamp'2018-01-01
> > 00:05:00'
> > GROUP BY eventid, floor(ts TO minute)
> >
> > If I add the predicate "ts > timestamp'2018-01-01 00:01:00' AND ts <
> > timestamp'2018-01-01 00:05:00'" to the materialized view then the
> > materialized view is chosen in the final plan. Would it be possible to
> > modify the MaterializedViewRule to automatically add a view predicate
> based
> > on the query when the materialized view being considerent aggregates
> over a
> > column that contains a floor()? This predicate would also need to be
> > applied to the rewritten view. I would like to file a JIRA and work on
> this
> > improvement if this makes sense.
> >
> > Thank you,
> > Thomas
> >
>

Reply via email to