Re: Materialized view query rewriting with joins

Michael Mior Mon, 13 Jun 2016 11:50:22 -0700

Julian,

I understand that MultiJoin cannot be implemented directly. I'm only using
it to collect information on multiple joins into a single node since it
makes further steps of the algorithm much easier. Ultimate the rewrite rule
is attempting to match a LogicalProject on top of a LogicalJoin of two
TableScans and convert it into a LogicalProject on top of a TableScan of
the materialized view. The MultiJoin is only used internally within the
rule and I never actually perform any transformation using the MultiJoin.
The only transformation I am attempting is to replace a Join with a
TableScan. This is what is failing.


--
Michael Mior
michael.m...@gmail.com

2016-06-13 14:44 GMT-04:00 Julian Hyde <jh...@apache.org>:

> Using MultiJoin is a good idea. But remember that it was designed with
> only one, very specific purpose: to gather joins to be optimized by a
> greedy algorithm.
>
> MultiJoin cannot be implemented directly. In fact the only rule that can
> digest it and convert it into regular Join nodes is LoptOptimizeJoinRule.
> That might explain why you cannot find a plan.
>
> Julian
>
>
> > On Jun 12, 2016, at 11:33 AM, Michael Mior <mm...@uwaterloo.ca> wrote:
> >
> > I'm close to getting something working for the case of two tables. I
> don't
> > think it should be too challenging to extend to multiple tables. The gist
> > of the approach is to match a Join over two TableScans and then use
> > JoinToMultiJoin rule to get a MultiJoin instance. This is useful because
> > it's much easier to deal with with join conditions across more than two
> > tables with a MultiJoin than several join instances. From there, the plan
> > is to find a materialized view with matching conditions and rewrite the
> > original to join to a LogicalProject over a TableScan over the
> materialized
> > view.
> >
> > To test this out, I created two instances of a new class called
> DummyTable
> > with infinite cost and created a materialized view over these tables. I
> > hoped this would be helpful for debugging because it forces Calcite to
> use
> > the materialized view to execute the query (since actually performing
> scans
> > over these tables has infinite cost).
> >
> > The matching and substitution part seems to be working great, but Calcite
> > still fails to find a plan for the query. Any help would be appreciated.
> > Below are some relevant components of different parts of the plan.
> (Sorry,
> > they're quite long). Thanks!
> >
> >
> >
> >
> >
> >
> > Original query matched by the new rule
> >
> > LogicalProject(following=[$5], time=[$1], tweet_id=[$0]): rowcount =
> > 1.7976931348623157E308, cumulative cost = {1.7976931348623157E308
> > rows, Infinity cpu, 0.0 io}, id = 641
> >  LogicalJoin(subset=[rel#636:Subset#22.NONE.[]], condition=[=($5, $3)],
> > joinType=[inner]): rowcount = 1.7976931348623157E308, cumulat
> > ive cost = {1.7976931348623157E308 rows, 0.0 cpu, 0.0 io}, id = 634
> >    DummyScan(subset=[rel#33:Subset#0.ENUMERABLE.[]], table=[[twissandra,
> > tweet]]): rowcount = 1.7976931348623157E308, cu
> > mulative cost = {inf}, id = 0
> >    DummyScan(subset=[rel#98:Subset#8.ENUMERABLE.[]], table=[[twissandra,
> > followers]]): rowcount = 1.7976931348623157E308
> > , cumulative cost = {inf}, id = 18
> >
> > Original query rewritten to use a MultiJoin
> >
> > LogicalProject(following=[$3], time=[$1], tweet_id=[$0]): rowcount = 1.0,
> > cumulative cost = {inf}, id = 667
> >  MultiJoin(joinFilter=[=($3, $2)], isFullOuterJoin=[false],
> > joinTypes=[[INNER, INNER]], outerJoinConditions=[[NULL, NULL]], projField
> > s=[[{0, 1}, {0}]]): rowcount = 1.0, cumulative cost = {inf}, id = 664
> >    LogicalProject(tweet_id=[$0], time=[$1], username=[$3]): rowcount =
> > 1.7976931348623157E308, cumulative cost = {inf}, id = 651
> >      DummyScan(table=[[twissandra, tweet]]): rowcount =
> > 1.7976931348623157E308, cumulative cost = {inf}, id = 0
> >    LogicalProject(following=[$1]): rowcount = 1.7976931348623157E308,
> > cumulative cost = {inf}, id = 652
> >      DummyScan(table=[[twissandra, followers]]): rowcount =
> > 1.7976931348623157E308, cumulative cost = {inf}, id = 18
> >
> > Replacement scan over the materialized view (this is what is substituted
> in
> > the original query)
> >
> > LogicalProject(following=[$1], time=[$2], tweet_id=[$0]): rowcount =
> 100.0,
> > cumulative cost = {300.0 rows, 701.0 cpu, 0.0 io}, id = 69
> > 3
> >  LogicalProject(username=[$0], time=[CAST($1):TIMESTAMP(0)],
> > tweet_id=[CAST($2):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE "ISO-88
> > 59-1$en_US$primary"]): rowcount = 100.0, cumulative cost = {200.0 rows,
> > 401.0 cpu, 0.0 io}, id = 44
> >    CassandraTableScan(table=[[twissandra, timeline]]): rowcount = 100.0,
> > cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 23
> >
> > --
> > Michael Mior
> > michael.m...@gmail.com <mailto:michael.m...@gmail.com>
> >
> > 2016-06-07 1:39 GMT-04:00 Amogh Margoor <amo...@qubole.com <mailto:
> amo...@qubole.com>>:
> >
> >> Sorry, I missed Lattices as I didn't know about Lattices without
> >> aggregations. Btw there is this rule where we try to replace star table
> >> scan by Tiles of Lattices with filters:
> >>
> >>
> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/com/qubole/quark/planner/FilterAggStarRule.java
> <
> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/com/qubole/quark/planner/FilterAggStarRule.java
> >
> >> I thought it might be useful reference if we are trying to optimise for
> >> materialised views defined on joins and filters(  e.g., MV defined by
> >> "select * from A join B on A.id=B.aid where A.x >10 and B.y <
> '2015-01-10'"
> >> )
> >>
> >> Regards,
> >> Amogh
> >>
> >> On Tue, Jun 7, 2016 at 7:06 AM, Julian Hyde <jh...@apache.org <mailto:
> jh...@apache.org>> wrote:
> >>
> >>> Yes, I think you could make Lattices work without aggregation, as long
> as
> >>> you keep the key assumption, which is that dropping joined tables does
> >> not
> >>> change the number of rows (so, primary key and foreign key integrity,
> >>> relationships are all many-to-one from the root (fact) table, and you
> >> have
> >>> to include the root table in your query).
> >>>
> >>> I might end up using lattices with and without aggregation in the Druid
> >>> adapter. Because Druid cannot do joins, I plan to map each Druid table
> >> to a
> >>> lattice, so that you can satisfy a star query using a Druid table.
> Druid
> >> is
> >>> *mainly* for aggregation, but Druid just introduced a “select” query[1]
> >>> that returns raw rows, and the Druid adapter can execute
> non-aggregation
> >>> queries.
> >>>
> >>> Once you’ve translated a query "A join B join C” onto a virtual table
> ABC
> >>> I suppose you could then apply other materialized view techniques.
> >>>
> >>> If you were able to implement the Goldstein & Larson algorithm it would
> >> be
> >>> interesting to compose it with other materialized view substitutions in
> >>> Calcite. After all, each materialized view substitution just generates
> >> more
> >>> alternative queries to optimize, so I suspect that we could compose
> >>> algorithms.
> >>>
> >>> Julian
> >>>
> >>> [1] http://druid.io/docs/latest/querying/select-query.html <
> >>> http://druid.io/docs/latest/querying/select-query.html <
> http://druid.io/docs/latest/querying/select-query.html>>
> >>>
> >>>
> >>>> On Jun 6, 2016, at 6:06 PM, Michael Mior <mm...@uwaterloo.ca> wrote:
> >>>>
> >>>> Thanks for the pointer. Any chance there's a good example somewhere
> >> using
> >>>> lattices that would help me get something working? Also, is it
> possible
> >>> to
> >>>> have lattices with no aggregation? Unfortunately either way it doesn't
> >>>> completely suit my use case since I'm going to have to do rewriting on
> >>>> joins aside from star queries. That said if I can get basic equijoins
> >>>> working that would be enough for me.
> >>>>
> >>>> I started trying to write a unification rule but decided it would be
> >>> easier
> >>>> to do something like MaterializedViewFilterScanRule. The first problem
> >>> that
> >>>> I've run into is trying to get a Join instance in a rule where the
> left
> >>> and
> >>>> right inputs are TableScan instances. Even if I have a
> >> RelOptRuleOperand
> >>>> that looks like the following I end up with a join whose left and
> right
> >>>> inputs are RelSubsets.
> >>>>
> >>>> operand(LogicalJoin.class, operand(TableScan.class, any()),
> >>>> operand(TableScan.class, any()))
> >>>>
> >>>> Anyway, I've been looking into implementing the approach from the
> >>> following
> >>>> paper. It's very nicely written and easy to follow. It also doesn't
> >>> require
> >>>> trying different join permutations. I'm starting with several
> >> additional
> >>>> restrictions (only equijoins, no aggregations, etc.)
> >>>>
> >>>>
> >>>
> >>
> ftp://ftp.cse.buffalo.edu/users/azhang/disc/SIGMOD/pdf-files/331/202-optimizing.pdf
> >>>>
> >>>> --
> >>>> Michael Mior
> >>>> michael.m...@gmail.com
> >>>>
> >>>> 2016-06-06 19:34 GMT-04:00 Julian Hyde <jh...@apache.org>:
> >>>>
> >>>>> Short answer: yes, if you use lattices.
> >>>>>
> >>>>> Remember, Calcite has two mechanisms for matching materialized views:
> >>> the
> >>>>> "general purpose" mechanism, and lattices.
> >>>>>
> >>>>> Using general purpose mechanism you can declare a materialized view
> >>> based
> >>>>> on any query, but there needs to be a unification rule to rewrite
> your
> >>>>> query onto that view. (That is, rewrite your query using
> >>>>> semantics-preserving algebraic transformations such that one branch
> of
> >>> the
> >>>>> rewritten query is identical to the query that defines the
> >> materialized
> >>>>> view. See
> http://www.slideshare.net/julianhyde/calcite-algebraedw2015
> >> <
> >>>>> http://www.slideshare.net/julianhyde/calcite-algebraedw2015> slides
> >>> 15+.)
> >>>>> There are unification rules for scan, filter, project and some kinds
> >> of
> >>>>> aggregation, but not join.
> >>>>>
> >>>>> Knowing how difficult it was to write unification rules, and knowing
> >> how
> >>>>> common was the use case of star queries (scan - filter - join -
> >>> aggregate),
> >>>>> I invented lattices. First you define a lattice (a collection of
> >> tables
> >>>>> joined using many-to-one relationships), then you (or the system)
> >>> defines
> >>>>> materialized views on top of that lattice. If an incoming query uses
> >>> tables
> >>>>> from that lattice, joined using the same join keys, then Calcite will
> >>>>> consider substituting any materialized views in that lattice.
> >>>>>
> >>>>> Matching queries to lattice materialized views is much more efficient
> >>> than
> >>>>> matching to general purpose materialized views, two reasons. (a) The
> >>> joins
> >>>>> are removed, the materialized view and the query become just projects
> >> of
> >>>>> the lattice’s columns, and we avoid the combinatorial explosion that
> >>> occurs
> >>>>> when joins are permuted. (b) A lattice can contain dozens of
> >>> materialized
> >>>>> views but it is clear, because they are partially ordered, which is
> >> the
> >>>>> best for a given query, and they are all considered at the same time.
> >>>>>
> >>>>> Julian
> >>>>>
> >>>>>> On Jun 5, 2016, at 18:58, amo...@qubole.com wrote:
> >>>>>>
> >>>>>> AFAIK it doesn't. Not sure if some work has been done towards it.
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>>> On 06-Jun-2016, at 7:08 AM, Michael Mior <mm...@uwaterloo.ca>
> >> wrote:
> >>>>>>>
> >>>>>>> Am I correct in understanding that Calcite doesn't currently
> rewrite
> >>>>>>> queries to use materialized views if the query which defines the
> >> view
> >>>>>>> includes a join? If this is the case, has there been any work
> >> towards
> >>>>>>> making this happen?
> >>>>>>>
> >>>>>>> --
> >>>>>>> Michael Mior
> >>>>>>> michael.m...@gmail.com
>
>

Re: Materialized view query rewriting with joins

Reply via email to