Re: Materialized view query rewriting with joins

Michael Mior Thu, 23 Jun 2016 14:27:12 -0700

I've been playing with this some more. The rule appears to perform the
correct substitution. That is, I change a LogicalProject on top of a Join
of two TableScans into a LogicalProject on top of a TableScan of the
materialized view. However, the new RelNode ends up in a different subset
within the state space of the planner so is never considered for usage. Any
idea why this could be happening? I'm just using the standard
VolcanoPlanner instance built when connecting via sqlline.


--
Michael Mior
michael.m...@gmail.com

2016-06-17 10:16 GMT-04:00 Michael Mior <mm...@uwaterloo.ca>:

> Here's the code I have so far
>
> https://gist.github.com/michaelmior/b2724b0c9fcd1f756f2e8379c99de66c
>
> I just add this rule along with MaterializedViewFilterScanRule in
> CalcitePrepareImpl. Note that there are probably some necessary checks
> missing and the code is a bit messy, but I'm hoping to get something simple
> working first. It's very possible that this will break on any nontrivial
> cases. Here's one example taken from the sample Cassandra dataset which
> *almost* seems to work. First I added the following materialization defined
>
>         {
>           table: 'userline',
>           sql: 'select \"users\".\"username\", \"tweets\".\"tweet_id\"
> from \"tweets\" join \"users\" on
> \"tweet_entity\".\"username\"=\"users\".\"username\"'
>         }
>
> The trivial case which I expect to work is that the query which defines
> the view should be rewritten to select directly from the view. After
> sifting through a lot of debug output, I can see that my new rule is indeed
> replacing the join on the two tables with a TableScan on the materialized
> view. However, the planner still fails to find a valid plan for the query.
>
> After filtering through the planner state, it seems at some point there is
> a cycle in the plan graph. However, I've been pretty careful not to
> construct such a situation so I'm not sure why this would be happening. Any
> further insight would be greatly appreciated :)
>
> Cheers,
> --
> Michael Mior
> michael.m...@gmail.com
>
> 2016-06-16 13:36 GMT-04:00 Jason Altekruse <ja...@dremio.com>:
>
>> I would be interested in taking a look at your work so far, if you can
>> throw it on a github gist or some other host that would allow for
>> commenting that would probably be best.
>>
>> Jason Altekruse
>> Software Engineer at Dremio
>> Apache Drill Committer
>>
>> On Wed, Jun 15, 2016 at 12:43 PM, Michael Mior <mm...@uwaterloo.ca>
>> wrote:
>>
>> > Any further ideas of what might be going on or suggestions on how to
>> debug
>> > this? I can put together a sample of the code I have so far if anyone is
>> > willing to spend a bit of time looking through it (~300 LOC). Thanks!
>> >
>> > --
>> > Michael Mior
>> > mm...@uwaterloo.ca
>> >
>> > 2016-06-13 14:49 GMT-04:00 Michael Mior <mm...@uwaterloo.ca>:
>> >
>> > > Julian,
>> > >
>> > > I understand that MultiJoin cannot be implemented directly. I'm only
>> > using
>> > > it to collect information on multiple joins into a single node since
>> it
>> > > makes further steps of the algorithm much easier. Ultimate the rewrite
>> > rule
>> > > is attempting to match a LogicalProject on top of a LogicalJoin of two
>> > > TableScans and convert it into a LogicalProject on top of a TableScan
>> of
>> > > the materialized view. The MultiJoin is only used internally within
>> the
>> > > rule and I never actually perform any transformation using the
>> MultiJoin.
>> > > The only transformation I am attempting is to replace a Join with a
>> > > TableScan. This is what is failing.
>> > >
>> > > --
>> > > Michael Mior
>> > > michael.m...@gmail.com
>> > >
>> > > 2016-06-13 14:44 GMT-04:00 Julian Hyde <jh...@apache.org>:
>> > >
>> > >> Using MultiJoin is a good idea. But remember that it was designed
>> with
>> > >> only one, very specific purpose: to gather joins to be optimized by a
>> > >> greedy algorithm.
>> > >>
>> > >> MultiJoin cannot be implemented directly. In fact the only rule that
>> can
>> > >> digest it and convert it into regular Join nodes is
>> > LoptOptimizeJoinRule.
>> > >> That might explain why you cannot find a plan.
>> > >>
>> > >> Julian
>> > >>
>> > >>
>> > >> > On Jun 12, 2016, at 11:33 AM, Michael Mior <mm...@uwaterloo.ca>
>> > wrote:
>> > >> >
>> > >> > I'm close to getting something working for the case of two tables.
>> I
>> > >> don't
>> > >> > think it should be too challenging to extend to multiple tables.
>> The
>> > >> gist
>> > >> > of the approach is to match a Join over two TableScans and then use
>> > >> > JoinToMultiJoin rule to get a MultiJoin instance. This is useful
>> > because
>> > >> > it's much easier to deal with with join conditions across more than
>> > two
>> > >> > tables with a MultiJoin than several join instances. From there,
>> the
>> > >> plan
>> > >> > is to find a materialized view with matching conditions and rewrite
>> > the
>> > >> > original to join to a LogicalProject over a TableScan over the
>> > >> materialized
>> > >> > view.
>> > >> >
>> > >> > To test this out, I created two instances of a new class called
>> > >> DummyTable
>> > >> > with infinite cost and created a materialized view over these
>> tables.
>> > I
>> > >> > hoped this would be helpful for debugging because it forces
>> Calcite to
>> > >> use
>> > >> > the materialized view to execute the query (since actually
>> performing
>> > >> scans
>> > >> > over these tables has infinite cost).
>> > >> >
>> > >> > The matching and substitution part seems to be working great, but
>> > >> Calcite
>> > >> > still fails to find a plan for the query. Any help would be
>> > appreciated.
>> > >> > Below are some relevant components of different parts of the plan.
>> > >> (Sorry,
>> > >> > they're quite long). Thanks!
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > Original query matched by the new rule
>> > >> >
>> > >> > LogicalProject(following=[$5], time=[$1], tweet_id=[$0]): rowcount
>> =
>> > >> > 1.7976931348623157E308, cumulative cost = {1.7976931348623157E308
>> > >> > rows, Infinity cpu, 0.0 io}, id = 641
>> > >> >  LogicalJoin(subset=[rel#636:Subset#22.NONE.[]], condition=[=($5,
>> > $3)],
>> > >> > joinType=[inner]): rowcount = 1.7976931348623157E308, cumulat
>> > >> > ive cost = {1.7976931348623157E308 rows, 0.0 cpu, 0.0 io}, id = 634
>> > >> >    DummyScan(subset=[rel#33:Subset#0.ENUMERABLE.[]],
>> > table=[[twissandra,
>> > >> > tweet]]): rowcount = 1.7976931348623157E308, cu
>> > >> > mulative cost = {inf}, id = 0
>> > >> >    DummyScan(subset=[rel#98:Subset#8.ENUMERABLE.[]],
>> > table=[[twissandra,
>> > >> > followers]]): rowcount = 1.7976931348623157E308
>> > >> > , cumulative cost = {inf}, id = 18
>> > >> >
>> > >> > Original query rewritten to use a MultiJoin
>> > >> >
>> > >> > LogicalProject(following=[$3], time=[$1], tweet_id=[$0]): rowcount
>> =
>> > >> 1.0,
>> > >> > cumulative cost = {inf}, id = 667
>> > >> >  MultiJoin(joinFilter=[=($3, $2)], isFullOuterJoin=[false],
>> > >> > joinTypes=[[INNER, INNER]], outerJoinConditions=[[NULL, NULL]],
>> > >> projField
>> > >> > s=[[{0, 1}, {0}]]): rowcount = 1.0, cumulative cost = {inf}, id =
>> 664
>> > >> >    LogicalProject(tweet_id=[$0], time=[$1], username=[$3]):
>> rowcount =
>> > >> > 1.7976931348623157E308, cumulative cost = {inf}, id = 651
>> > >> >      DummyScan(table=[[twissandra, tweet]]): rowcount =
>> > >> > 1.7976931348623157E308, cumulative cost = {inf}, id = 0
>> > >> >    LogicalProject(following=[$1]): rowcount =
>> 1.7976931348623157E308,
>> > >> > cumulative cost = {inf}, id = 652
>> > >> >      DummyScan(table=[[twissandra, followers]]): rowcount =
>> > >> > 1.7976931348623157E308, cumulative cost = {inf}, id = 18
>> > >> >
>> > >> > Replacement scan over the materialized view (this is what is
>> > >> substituted in
>> > >> > the original query)
>> > >> >
>> > >> > LogicalProject(following=[$1], time=[$2], tweet_id=[$0]): rowcount
>> =
>> > >> 100.0,
>> > >> > cumulative cost = {300.0 rows, 701.0 cpu, 0.0 io}, id = 69
>> > >> > 3
>> > >> >  LogicalProject(username=[$0], time=[CAST($1):TIMESTAMP(0)],
>> > >> > tweet_id=[CAST($2):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
>> > "ISO-88
>> > >> > 59-1$en_US$primary"]): rowcount = 100.0, cumulative cost = {200.0
>> > rows,
>> > >> > 401.0 cpu, 0.0 io}, id = 44
>> > >> >    CassandraTableScan(table=[[twissandra, timeline]]): rowcount =
>> > 100.0,
>> > >> > cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 23
>> > >> >
>> > >> > --
>> > >> > Michael Mior
>> > >> > michael.m...@gmail.com <mailto:michael.m...@gmail.com>
>> > >> >
>> > >> > 2016-06-07 1:39 GMT-04:00 Amogh Margoor <amo...@qubole.com
>> <mailto:
>> > >> amo...@qubole.com>>:
>> > >> >
>> > >> >> Sorry, I missed Lattices as I didn't know about Lattices without
>> > >> >> aggregations. Btw there is this rule where we try to replace star
>> > table
>> > >> >> scan by Tiles of Lattices with filters:
>> > >> >>
>> > >> >>
>> > >>
>> >
>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/com/qubole/quark/planner/FilterAggStarRule.java
>> > >> <
>> > >>
>> >
>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/com/qubole/quark/planner/FilterAggStarRule.java
>> > >> >
>> > >> >> I thought it might be useful reference if we are trying to
>> optimise
>> > for
>> > >> >> materialised views defined on joins and filters(  e.g., MV
>> defined by
>> > >> >> "select * from A join B on A.id=B.aid where A.x >10 and B.y <
>> > >> '2015-01-10'"
>> > >> >> )
>> > >> >>
>> > >> >> Regards,
>> > >> >> Amogh
>> > >> >>
>> > >> >> On Tue, Jun 7, 2016 at 7:06 AM, Julian Hyde <jh...@apache.org
>> > <mailto:
>> > >> jh...@apache.org>> wrote:
>> > >> >>
>> > >> >>> Yes, I think you could make Lattices work without aggregation, as
>> > >> long as
>> > >> >>> you keep the key assumption, which is that dropping joined tables
>> > does
>> > >> >> not
>> > >> >>> change the number of rows (so, primary key and foreign key
>> > integrity,
>> > >> >>> relationships are all many-to-one from the root (fact) table, and
>> > you
>> > >> >> have
>> > >> >>> to include the root table in your query).
>> > >> >>>
>> > >> >>> I might end up using lattices with and without aggregation in the
>> > >> Druid
>> > >> >>> adapter. Because Druid cannot do joins, I plan to map each Druid
>> > table
>> > >> >> to a
>> > >> >>> lattice, so that you can satisfy a star query using a Druid
>> table.
>> > >> Druid
>> > >> >> is
>> > >> >>> *mainly* for aggregation, but Druid just introduced a “select”
>> > >> query[1]
>> > >> >>> that returns raw rows, and the Druid adapter can execute
>> > >> non-aggregation
>> > >> >>> queries.
>> > >> >>>
>> > >> >>> Once you’ve translated a query "A join B join C” onto a virtual
>> > table
>> > >> ABC
>> > >> >>> I suppose you could then apply other materialized view
>> techniques.
>> > >> >>>
>> > >> >>> If you were able to implement the Goldstein & Larson algorithm it
>> > >> would
>> > >> >> be
>> > >> >>> interesting to compose it with other materialized view
>> substitutions
>> > >> in
>> > >> >>> Calcite. After all, each materialized view substitution just
>> > generates
>> > >> >> more
>> > >> >>> alternative queries to optimize, so I suspect that we could
>> compose
>> > >> >>> algorithms.
>> > >> >>>
>> > >> >>> Julian
>> > >> >>>
>> > >> >>> [1] http://druid.io/docs/latest/querying/select-query.html <
>> > >> >>> http://druid.io/docs/latest/querying/select-query.html <
>> > >> http://druid.io/docs/latest/querying/select-query.html>>
>> > >> >>>
>> > >> >>>
>> > >> >>>> On Jun 6, 2016, at 6:06 PM, Michael Mior <mm...@uwaterloo.ca>
>> > wrote:
>> > >> >>>>
>> > >> >>>> Thanks for the pointer. Any chance there's a good example
>> somewhere
>> > >> >> using
>> > >> >>>> lattices that would help me get something working? Also, is it
>> > >> possible
>> > >> >>> to
>> > >> >>>> have lattices with no aggregation? Unfortunately either way it
>> > >> doesn't
>> > >> >>>> completely suit my use case since I'm going to have to do
>> rewriting
>> > >> on
>> > >> >>>> joins aside from star queries. That said if I can get basic
>> > equijoins
>> > >> >>>> working that would be enough for me.
>> > >> >>>>
>> > >> >>>> I started trying to write a unification rule but decided it
>> would
>> > be
>> > >> >>> easier
>> > >> >>>> to do something like MaterializedViewFilterScanRule. The first
>> > >> problem
>> > >> >>> that
>> > >> >>>> I've run into is trying to get a Join instance in a rule where
>> the
>> > >> left
>> > >> >>> and
>> > >> >>>> right inputs are TableScan instances. Even if I have a
>> > >> >> RelOptRuleOperand
>> > >> >>>> that looks like the following I end up with a join whose left
>> and
>> > >> right
>> > >> >>>> inputs are RelSubsets.
>> > >> >>>>
>> > >> >>>> operand(LogicalJoin.class, operand(TableScan.class, any()),
>> > >> >>>> operand(TableScan.class, any()))
>> > >> >>>>
>> > >> >>>> Anyway, I've been looking into implementing the approach from
>> the
>> > >> >>> following
>> > >> >>>> paper. It's very nicely written and easy to follow. It also
>> doesn't
>> > >> >>> require
>> > >> >>>> trying different join permutations. I'm starting with several
>> > >> >> additional
>> > >> >>>> restrictions (only equijoins, no aggregations, etc.)
>> > >> >>>>
>> > >> >>>>
>> > >> >>>
>> > >> >>
>> > >>
>> >
>> ftp://ftp.cse.buffalo.edu/users/azhang/disc/SIGMOD/pdf-files/331/202-optimizing.pdf
>> > >> >>>>
>> > >> >>>> --
>> > >> >>>> Michael Mior
>> > >> >>>> michael.m...@gmail.com
>> > >> >>>>
>> > >> >>>> 2016-06-06 19:34 GMT-04:00 Julian Hyde <jh...@apache.org>:
>> > >> >>>>
>> > >> >>>>> Short answer: yes, if you use lattices.
>> > >> >>>>>
>> > >> >>>>> Remember, Calcite has two mechanisms for matching materialized
>> > >> views:
>> > >> >>> the
>> > >> >>>>> "general purpose" mechanism, and lattices.
>> > >> >>>>>
>> > >> >>>>> Using general purpose mechanism you can declare a materialized
>> > view
>> > >> >>> based
>> > >> >>>>> on any query, but there needs to be a unification rule to
>> rewrite
>> > >> your
>> > >> >>>>> query onto that view. (That is, rewrite your query using
>> > >> >>>>> semantics-preserving algebraic transformations such that one
>> > branch
>> > >> of
>> > >> >>> the
>> > >> >>>>> rewritten query is identical to the query that defines the
>> > >> >> materialized
>> > >> >>>>> view. See
>> > >> http://www.slideshare.net/julianhyde/calcite-algebraedw2015
>> > >> >> <
>> > >> >>>>> http://www.slideshare.net/julianhyde/calcite-algebraedw2015>
>> > slides
>> > >> >>> 15+.)
>> > >> >>>>> There are unification rules for scan, filter, project and some
>> > kinds
>> > >> >> of
>> > >> >>>>> aggregation, but not join.
>> > >> >>>>>
>> > >> >>>>> Knowing how difficult it was to write unification rules, and
>> > knowing
>> > >> >> how
>> > >> >>>>> common was the use case of star queries (scan - filter - join -
>> > >> >>> aggregate),
>> > >> >>>>> I invented lattices. First you define a lattice (a collection
>> of
>> > >> >> tables
>> > >> >>>>> joined using many-to-one relationships), then you (or the
>> system)
>> > >> >>> defines
>> > >> >>>>> materialized views on top of that lattice. If an incoming query
>> > uses
>> > >> >>> tables
>> > >> >>>>> from that lattice, joined using the same join keys, then
>> Calcite
>> > >> will
>> > >> >>>>> consider substituting any materialized views in that lattice.
>> > >> >>>>>
>> > >> >>>>> Matching queries to lattice materialized views is much more
>> > >> efficient
>> > >> >>> than
>> > >> >>>>> matching to general purpose materialized views, two reasons.
>> (a)
>> > The
>> > >> >>> joins
>> > >> >>>>> are removed, the materialized view and the query become just
>> > >> projects
>> > >> >> of
>> > >> >>>>> the lattice’s columns, and we avoid the combinatorial explosion
>> > that
>> > >> >>> occurs
>> > >> >>>>> when joins are permuted. (b) A lattice can contain dozens of
>> > >> >>> materialized
>> > >> >>>>> views but it is clear, because they are partially ordered,
>> which
>> > is
>> > >> >> the
>> > >> >>>>> best for a given query, and they are all considered at the same
>> > >> time.
>> > >> >>>>>
>> > >> >>>>> Julian
>> > >> >>>>>
>> > >> >>>>>> On Jun 5, 2016, at 18:58, amo...@qubole.com wrote:
>> > >> >>>>>>
>> > >> >>>>>> AFAIK it doesn't. Not sure if some work has been done towards
>> it.
>> > >> >>>>>> Sent from my iPhone
>> > >> >>>>>>
>> > >> >>>>>>> On 06-Jun-2016, at 7:08 AM, Michael Mior <mm...@uwaterloo.ca
>> >
>> > >> >> wrote:
>> > >> >>>>>>>
>> > >> >>>>>>> Am I correct in understanding that Calcite doesn't currently
>> > >> rewrite
>> > >> >>>>>>> queries to use materialized views if the query which defines
>> the
>> > >> >> view
>> > >> >>>>>>> includes a join? If this is the case, has there been any work
>> > >> >> towards
>> > >> >>>>>>> making this happen?
>> > >> >>>>>>>
>> > >> >>>>>>> --
>> > >> >>>>>>> Michael Mior
>> > >> >>>>>>> michael.m...@gmail.com
>> > >>
>> > >>
>> > >
>> >
>>
>
>

Re: Materialized view query rewriting with joins

Reply via email to