Re: Using indexes rather than table scans with Calcite

xu Sun, 31 May 2020 09:28:58 -0700

Hi Tim,

I am working on MySQL InnoDB adapter and trying to introduce this to
Calcite, currently it is only in early stage, and not approved/reviewed by
committers yet. Anyway, we are facing the same problem like what index to
use, how to push down order by operation, etc. I have developed a simple
rule based adapter to be "index aware" and being able to leverage a MySQL
InnoDB storage engine written in Java. Hope this will help you to explore
more options.


https://issues.apache.org/jira/browse/CALCITE-4034

Thanks,
Xu

Haisheng Yuan <hy...@apache.org> 于2020年5月31日周日 下午10:06写道：

> Hi Roman,
>
> Thank you for sharing your thoughts.
>
> > It can be very tricky because the rule should consider not
> > only filters, but also collations. This leads to increasing the
> > complexity of such rules.
>
> Logical transformation rules like FilterTableScan2IndexTableScanRule
> should not consider physical properties, like collation, distribution. You
> forgot that we just reached consensus in CALCITE-3972. :)
>
> Regarding the 2nd option that uses index b, it is indeed not that easy for
> Calcite 1.22.0. In latest version, it now becomes possible. After rule
> transformation, during top-down trait request, the collation is passed
> through Filter, down to physical TableScan, which accepts the trait request
> with collation on b, find there is an index on b, and return a new RelNode
> IndexScan. This process can be done in
> EnumerableTableScan#passThrough(required).
>
> > I hope, when the Cascades-style optimizer become a part of Calcite, the
> search space
> > pollution will not be a serious issue anymore.
>
> I hope so too. Top-down style can help alleviate space pollution by space
> pruning. But the space pollution caused by LogicalProject operator and its
> related rules still can't be avoided. :)
>
> Again, thanks for sharing your experience with us.
>
> Haisheng
>
> On 2020/05/31 09:58:36, Roman Kondakov <kondako...@mail.ru.INVALID>
> wrote:
> > Hi Haisheng,
> >
> > The basic rationale behind the using materialized views for secondary
> > index representation instead of special rules like mentioned
> > FilterTableScan2IndexTableScanRule is the simplicity of implementation.
> >
> > You are absolutely right that materialized views approach has an obvious
> > drawback that it should register all indexes as materialized views in
> > the optimizer's search space. But we expect our users will not overuse
> > indexes because in general having too many indexes is a bad practice: on
> > each table update we also should update it's indexes and it can cause
> > some performance degradation of the system as a whole, not only
> > optimizer. So we expect the number of indexes will be relatively small.
> >
> > Ignite uses indexes not only for index lookups, but also for exploiting
> > it's sortedness. In this case materialized view's approach can show some
> > advantages. Let's consider the example:
> >
> > SELECT * FROM foo WHERE a > 100 ORDER BY b;
> >
> > where both fields 'a' and 'b' are indexed. In this case we will have two
> > alternatives of the query execution:
> >
> > 1. Use index 'a' for index conditioned scan and then sort rows by 'b'
> > 2. Use index scan over 'b' and then apply filter over 'a' - here we can
> > avoid sorting, because index over 'b' is already sorted.
> >
> > If I understand the approach with FilterTableScan2IndexTableScanRule
> > correctly, at this step the rule should make a decision about which
> > index to use. It can be very tricky because the rule should consider not
> > only filters, but also collations. This leads to increasing the
> > complexity of such rules. With materialized views approach we just
> > register all indexes with their caollation and let the cost model do its
> > job. Our rules are very simple. The complexity is encapsulated in the
> > index scan cost estimation.
> >
> > As for your example with disjunctive predicate:
> >
> > > SELECT * FROM foo WHERE a > 100 or b < 1000;
> >
> > I think with materialized views approach we can do a complex logic as
> > well. For example, we may implement a special rule for such cases:
> > BitmapOrRule. Or, even better, we will rewrite such predicates to a
> > UNION ALL clause. A very good explanation and a comparison of both
> > approaches I've found in this Oracle's blog post [1].
> >
> > As a conclusion: choosing between materialized views approach and
> > IndexRule-based approach is a trade-off between complexity of
> > implementation and search space pollution. I hope, when the
> > Cascades-style optimizer become a part of Calcite, the search space
> > pollution will not be a serious issue anymore.
> >
> >
> > Thanks!
> >
> >
> > [1]
> >
> https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion
> >
> > --
> > Kind Regards
> > Roman Kondakov
> >
> >
> > On 31.05.2020 07:31, Haisheng Yuan wrote:
> > > Thanks Julian and Roman for sharing the experiences for modeling
> indexes.
> > >
> > > Besides using materialized views, which is already proven by Phoenix
> and Ignite, there is another approach, as mentioned by Vladimir, define
> your own rules and indexscan operators.
> > >
> > > FilterTableScan2IndexScanRule and its variances, which match Filter
> over TableScan, create an IndexScan if the table has corresponding index on
> the filter column. You don't have to register different tablescans for all
> the indexes, in case you have nearly thousands indexes for a table, say
> 999, which is allowed in SQL Server 2016. It can also support more complex
> scenario, e.g.:
> > >
> > > SELECT * FROM foo WHERE a > 100 or b < 1000;
> > > If there is an index on column a, and another index on column b, we
> may need to utilize both indexes, through Bitmap TableScan.
> > >
> > >  Bitmap Table Scan on foo
> > >    ->  BitmapOR
> > >          ->  Bitmap Index Scan on foo_index_a
> > >                Index Condition: (a > 100)
> > >          ->  Bitmap Index Scan on foo_index_b
> > >                Index Condition: (b < 1000)
> > >
> > > But still, this approach requires some non-trivial work to do.
> > >
> > > Hi Roman, I believe you definitely have consulted both approaches for
> Apache Ignite to work with indexes, you decided to go with materialized
> views, there are some reasons and tradeoffs to consider behind your
> decision. Do you mind sharing with us?
> > >
> > > Thanks,
> > > Haisheng
> > >
> > >
> > > On 2020/05/29 17:34:27, Julian Hyde <jh...@apache.org> wrote:
> > >> Materialized views are not a hack, as Vladimir claims. Materialized
> views are a fundamental concept in relational algebra, and they are an
> elegant way - in my opinion the correct way -  to model indexes (and many
> other structures).
> > >>
> > >> In Calcite materialized views are a feature in the planner that
> allows you to declare a table as equivalent to a query. (You do not need to
> issue a CREATE MATERIALIZED VIEW statement. They are internal.) Then you
> can use algebra to substitute one for the other, and apply cost-based
> optimization to choose between plans with & without the materialized view.
> > >>
> > >>
> > >> Other people have used this approach in the past. Search this list
> and you should find discussions. Also I have a talk with Maryann Xue about
> planning for indexes in Apache Phoenix[1].
> > >>
> > >> Julian
> > >>
> > >> [1]
> https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
> <
> https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
> >
> > >>
> > >>> On May 29, 2020, at 5:28 AM, Roman Kondakov
> <kondako...@mail.ru.INVALID> wrote:
> > >>>
> > >>> Hi Tim,
> > >>>
> > >>> In Apache Ignite we've already faced this challenge. We solved it
> using
> > >>> materialized views and FilterableTable. Let's consider your example:
> > >>>
> > >>>> select * from users where country='UK' and some_other_column='foo';
> > >>>
> > >>> with a primary index and a sorted secondary index (B+Tree?) over the
> > >>> 'country' field.
> > >>>
> > >>> As a first step Ignite registers all indexes that might be helpful
> for
> > >>> the query plan in VolcanoPlanner as materialized views. For now we
> > >>> register all indexes in all tables that participate in the query.
> > >>> Registering all indexes might be excessive, but maybe we will apply
> some
> > >>> pruning later. So it's ok for now to register all indexes.
> > >>>
> > >>> Index materialized view is very simple: it's just a sorted table
> scan:
> > >>>
> > >>> 'SELECT * from tbl ORDER BY idx_fields'
> > >>>
> > >>> After registering indexes as materialized views, the optimizer's
> search
> > >>> space will look like this:
> > >>>
> > >>> Project([*])
> > >>>  Filter[country='UK' and some_other_column='foo']
> > >>>    Set0
> > >>>
> > >>> where Set0 consists of table and index scans:
> > >>>
> > >>> TableScan('tbl', collation=[])
> > >>> IndexScan('PK', collation=[PK_field])
> > >>> IndexScan('country_idx', collation=[country])
> > >>>
> > >>> At this step we have our index scans registered in the optimizer
> within
> > >>> the same equivalence set as a TableScan.
> > >>>
> > >>> The next step is a pushing filters down to the scans. We do it with a
> > >>> rule which is similar to 'FilterTableScanRule'. After applying this
> rule
> > >>> we have a search space that is looking like this:
> > >>>
> > >>> Project([*])
> > >>>  TableScan('tbl', collation=[], filter=[country='UK' and
> > >>> some_other_column='foo'])
> > >>>  IndexScan('PK', collation=[PK_field], filter=[country='UK' and
> > >>> some_other_column='foo'])
> > >>>  IndexScan('country_idx', collation=[country], filter=[country='UK'
> and
> > >>> some_other_column='foo'])
> > >>>
> > >>> And the final step is adjusting the cost model to make it select the
> > >>> scan with the lower cost which depends on the filter conditions
> within
> > >>> the Scan. For example, full table scan with filters
> > >>>
> > >>> TableScan('tbl', collation=[], filter=[country='UK' and
> > >>> some_other_column='foo'])
> > >>>
> > >>> will cost, say, 100. Because it have to scan all rows and then filter
> > >>> out some set of them. On the other hand the index scan that can do
> index
> > >>> lookup instead of full scan will have a less cost. For example
> > >>>
> > >>> IndexScan('country_idx', collation=[country], filter=[country='UK'
> and
> > >>> some_other_column='foo'])
> > >>>
> > >>> will have cost about ~10, or so because it has a good index condition
> > >>> `country='UK'` which can be used for index lookup that that returns
> only
> > >>> 10% of rows. And therefore this IndexScan should be chosen as the
> best
> > >>> plan by the optimizer.
> > >>>
> > >>> We've recently implemented this approach in Apache Ignite and it
> works
> > >>> well for us. You can find it in [1]. This PR has many changes that
> are
> > >>> unrelated to the main topic. So particularly you can look at
> > >>> `IgnitePlanner.materializations()' method which registers indexes as
> > >>> materialized views and `IgniteTableScan` which performs filter
> > >>> conditions assessment and index scan cost estimation.
> > >>>
> > >>>
> > >>> [1] https://github.com/apache/ignite/pull/7813
> > >>>
> > >>> --
> > >>> Kind Regards
> > >>> Roman Kondakov
> > >>>
> > >>>
> > >>> On 29.05.2020 11:44, Tim Fox wrote:
> > >>>> Hi,
> > >>>>
> > >>>> I'm building a query engine with Calcite - really enjoying working
> with
> > >>>> Calcite so far!
> > >>>>
> > >>>> When creating a plan, it seems Calcite always creates a plan where
> the
> > >>>> sources are table scans, however in my implementation the tables
> can have
> > >>>> indexes on them so a table scan is not always the right choice.
> > >>>>
> > >>>> I was wondering if there was any way of making Calcite "index
> aware" - e.g.
> > >>>> perhaps providing hints to the table scan instance that, actually,
> an index
> > >>>> scan or a primary key lookup should be used instead of actually
> scanning
> > >>>> the table. E.g. On the table meta-data if we provided information
> about any
> > >>>> indexes on the table, then Calcite could figure out what parts of
> the query
> > >>>> to push to the table scan and which to keep in the rest of the plan.
> > >>>>
> > >>>> There are two specific cases I really care about:
> > >>>>
> > >>>> 1. Queries that contain a primary key lookup:
> > >>>>
> > >>>> select * from some_table where key_column=23 AND
> some_other_column='foo';
> > >>>>
> > >>>> In the above case the 'select * from some_table where
> key_column=23' can be
> > >>>> implemented as a simple PK lookup in the source table, not
> requiring a
> > >>>> scan, thus leaving just the filter corresponding to
> > >>>> 'some_other_column='foo'' in the rest of the plan
> > >>>>
> > >>>> 2. Queries with expressions on a column which has a secondary index
> > >>>>
> > >>>> select * from users where country='UK' and some_other_column='foo';
> > >>>>
> > >>>> We have many users, and let's say 10% of them are from UK (still a
> lot). We
> > >>>> have a secondary index in the country column in the source table so
> we can
> > >>>> do an efficient index scan to retrieve the matching records.
> > >>>>
> > >>>> I found this document
> > >>>> https://calcite.apache.org/docs/materialized_views.html which
> seems like it
> > >>>> might help me in some way.
> > >>>>
> > >>>> The idea being if I can think of my indexes as materialized views
> then the
> > >>>> query can be written against those materialized views as sources
> instead of
> > >>>> the original table sources. There appears to be a rule
> > >>>> 'MaterializedViewRule' that does this already (?).
> > >>>>
> > >>>> This seems to get me a bit further, however, for this approach to
> work, it
> > >>>> seems I would have to create materialized views _dynamically_ during
> > >>>> evaluation of the query, register them, rewrite the query, execute
> it, then
> > >>>> deregister the materialized view.
> > >>>>
> > >>>> E.g. for the primary key lookup example above, for the following
> query:
> > >>>>
> > >>>> select * from some_table where key_column=23 AND
> some_other_column='foo';
> > >>>>
> > >>>> I would need to dynamically create a materialized view
> corresponding to:
> > >>>>
> > >>>> select * from some_table where key_column=23
> > >>>>
> > >>>> Then rewrite the query using MaterializedViewRule.
> > >>>>
> > >>>> In the general case, in order to figure out what materialized views
> I need
> > >>>> to dynamically create I would need to examine the query, figure out
> which
> > >>>> columns in expressions have indexes on them and from them work out
> the best
> > >>>> materialized view to create based on that information. This seems
> non
> > >>>> trivial.
> > >>>>
> > >>>> Does anyone have any suggestions or pointers for how to implement
> this kind
> > >>>> of thing? I suspect I'm not the first person to have tried to do
> this, as
> > >>>> using indexes on tables seems a pretty common thing in many systems
> (?)
> > >>>>
> > >>
> > >>
> >
>


-- 

Best regards,

Xu

Re: Using indexes rather than table scans with Calcite

Reply via email to