Re: Using indexes rather than table scans with Calcite

Roman Kondakov Sun, 31 May 2020 02:58:57 -0700

Hi Haisheng,

The basic rationale behind the using materialized views for secondary
index representation instead of special rules like mentioned
FilterTableScan2IndexTableScanRule is the simplicity of implementation.


You are absolutely right that materialized views approach has an obvious
drawback that it should register all indexes as materialized views in
the optimizer's search space. But we expect our users will not overuse
indexes because in general having too many indexes is a bad practice: on
each table update we also should update it's indexes and it can cause
some performance degradation of the system as a whole, not only
optimizer. So we expect the number of indexes will be relatively small.

Ignite uses indexes not only for index lookups, but also for exploiting
it's sortedness. In this case materialized view's approach can show some
advantages. Let's consider the example:

SELECT * FROM foo WHERE a > 100 ORDER BY b;

where both fields 'a' and 'b' are indexed. In this case we will have two
alternatives of the query execution:

1. Use index 'a' for index conditioned scan and then sort rows by 'b'
2. Use index scan over 'b' and then apply filter over 'a' - here we can
avoid sorting, because index over 'b' is already sorted.

If I understand the approach with FilterTableScan2IndexTableScanRule
correctly, at this step the rule should make a decision about which
index to use. It can be very tricky because the rule should consider not
only filters, but also collations. This leads to increasing the
complexity of such rules. With materialized views approach we just
register all indexes with their caollation and let the cost model do its
job. Our rules are very simple. The complexity is encapsulated in the
index scan cost estimation.

As for your example with disjunctive predicate:

> SELECT * FROM foo WHERE a > 100 or b < 1000;

I think with materialized views approach we can do a complex logic as
well. For example, we may implement a special rule for such cases:
BitmapOrRule. Or, even better, we will rewrite such predicates to a
UNION ALL clause. A very good explanation and a comparison of both
approaches I've found in this Oracle's blog post [1].

As a conclusion: choosing between materialized views approach and
IndexRule-based approach is a trade-off between complexity of
implementation and search space pollution. I hope, when the
Cascades-style optimizer become a part of Calcite, the search space
pollution will not be a serious issue anymore.


Thanks!


[1]
https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion

-- 
Kind Regards
Roman Kondakov


On 31.05.2020 07:31, Haisheng Yuan wrote:
> Thanks Julian and Roman for sharing the experiences for modeling indexes.
> 
> Besides using materialized views, which is already proven by Phoenix and 
> Ignite, there is another approach, as mentioned by Vladimir, define your own 
> rules and indexscan operators.
> 
> FilterTableScan2IndexScanRule and its variances, which match Filter over 
> TableScan, create an IndexScan if the table has corresponding index on the 
> filter column. You don't have to register different tablescans for all the 
> indexes, in case you have nearly thousands indexes for a table, say 999, 
> which is allowed in SQL Server 2016. It can also support more complex 
> scenario, e.g.:
> 
> SELECT * FROM foo WHERE a > 100 or b < 1000;
> If there is an index on column a, and another index on column b, we may need 
> to utilize both indexes, through Bitmap TableScan.
> 
>  Bitmap Table Scan on foo
>    ->  BitmapOR
>          ->  Bitmap Index Scan on foo_index_a
>                Index Condition: (a > 100)
>          ->  Bitmap Index Scan on foo_index_b
>                Index Condition: (b < 1000)
> 
> But still, this approach requires some non-trivial work to do.
> 
> Hi Roman, I believe you definitely have consulted both approaches for Apache 
> Ignite to work with indexes, you decided to go with materialized views, there 
> are some reasons and tradeoffs to consider behind your decision. Do you mind 
> sharing with us?
> 
> Thanks,
> Haisheng
> 
> 
> On 2020/05/29 17:34:27, Julian Hyde <jh...@apache.org> wrote: 
>> Materialized views are not a hack, as Vladimir claims. Materialized views 
>> are a fundamental concept in relational algebra, and they are an elegant way 
>> - in my opinion the correct way -  to model indexes (and many other 
>> structures).
>>
>> In Calcite materialized views are a feature in the planner that allows you 
>> to declare a table as equivalent to a query. (You do not need to issue a 
>> CREATE MATERIALIZED VIEW statement. They are internal.) Then you can use 
>> algebra to substitute one for the other, and apply cost-based optimization 
>> to choose between plans with & without the materialized view.
>>
>>
>> Other people have used this approach in the past. Search this list and you 
>> should find discussions. Also I have a talk with Maryann Xue about planning 
>> for indexes in Apache Phoenix[1].
>>
>> Julian
>>
>> [1] 
>> https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
>>  
>> <https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite>
>>
>>> On May 29, 2020, at 5:28 AM, Roman Kondakov <kondako...@mail.ru.INVALID> 
>>> wrote:
>>>
>>> Hi Tim,
>>>
>>> In Apache Ignite we've already faced this challenge. We solved it using
>>> materialized views and FilterableTable. Let's consider your example:
>>>
>>>> select * from users where country='UK' and some_other_column='foo';
>>>
>>> with a primary index and a sorted secondary index (B+Tree?) over the
>>> 'country' field.
>>>
>>> As a first step Ignite registers all indexes that might be helpful for
>>> the query plan in VolcanoPlanner as materialized views. For now we
>>> register all indexes in all tables that participate in the query.
>>> Registering all indexes might be excessive, but maybe we will apply some
>>> pruning later. So it's ok for now to register all indexes.
>>>
>>> Index materialized view is very simple: it's just a sorted table scan:
>>>
>>> 'SELECT * from tbl ORDER BY idx_fields'
>>>
>>> After registering indexes as materialized views, the optimizer's search
>>> space will look like this:
>>>
>>> Project([*])
>>>  Filter[country='UK' and some_other_column='foo']
>>>    Set0
>>>
>>> where Set0 consists of table and index scans:
>>>
>>> TableScan('tbl', collation=[])
>>> IndexScan('PK', collation=[PK_field])
>>> IndexScan('country_idx', collation=[country])
>>>
>>> At this step we have our index scans registered in the optimizer within
>>> the same equivalence set as a TableScan.
>>>
>>> The next step is a pushing filters down to the scans. We do it with a
>>> rule which is similar to 'FilterTableScanRule'. After applying this rule
>>> we have a search space that is looking like this:
>>>
>>> Project([*])
>>>  TableScan('tbl', collation=[], filter=[country='UK' and
>>> some_other_column='foo'])
>>>  IndexScan('PK', collation=[PK_field], filter=[country='UK' and
>>> some_other_column='foo'])
>>>  IndexScan('country_idx', collation=[country], filter=[country='UK' and
>>> some_other_column='foo'])
>>>
>>> And the final step is adjusting the cost model to make it select the
>>> scan with the lower cost which depends on the filter conditions within
>>> the Scan. For example, full table scan with filters
>>>
>>> TableScan('tbl', collation=[], filter=[country='UK' and
>>> some_other_column='foo'])
>>>
>>> will cost, say, 100. Because it have to scan all rows and then filter
>>> out some set of them. On the other hand the index scan that can do index
>>> lookup instead of full scan will have a less cost. For example
>>>
>>> IndexScan('country_idx', collation=[country], filter=[country='UK' and
>>> some_other_column='foo'])
>>>
>>> will have cost about ~10, or so because it has a good index condition
>>> `country='UK'` which can be used for index lookup that that returns only
>>> 10% of rows. And therefore this IndexScan should be chosen as the best
>>> plan by the optimizer.
>>>
>>> We've recently implemented this approach in Apache Ignite and it works
>>> well for us. You can find it in [1]. This PR has many changes that are
>>> unrelated to the main topic. So particularly you can look at
>>> `IgnitePlanner.materializations()' method which registers indexes as
>>> materialized views and `IgniteTableScan` which performs filter
>>> conditions assessment and index scan cost estimation.
>>>
>>>
>>> [1] https://github.com/apache/ignite/pull/7813
>>>
>>> -- 
>>> Kind Regards
>>> Roman Kondakov
>>>
>>>
>>> On 29.05.2020 11:44, Tim Fox wrote:
>>>> Hi,
>>>>
>>>> I'm building a query engine with Calcite - really enjoying working with
>>>> Calcite so far!
>>>>
>>>> When creating a plan, it seems Calcite always creates a plan where the
>>>> sources are table scans, however in my implementation the tables can have
>>>> indexes on them so a table scan is not always the right choice.
>>>>
>>>> I was wondering if there was any way of making Calcite "index aware" - e.g.
>>>> perhaps providing hints to the table scan instance that, actually, an index
>>>> scan or a primary key lookup should be used instead of actually scanning
>>>> the table. E.g. On the table meta-data if we provided information about any
>>>> indexes on the table, then Calcite could figure out what parts of the query
>>>> to push to the table scan and which to keep in the rest of the plan.
>>>>
>>>> There are two specific cases I really care about:
>>>>
>>>> 1. Queries that contain a primary key lookup:
>>>>
>>>> select * from some_table where key_column=23 AND some_other_column='foo';
>>>>
>>>> In the above case the 'select * from some_table where key_column=23' can be
>>>> implemented as a simple PK lookup in the source table, not requiring a
>>>> scan, thus leaving just the filter corresponding to
>>>> 'some_other_column='foo'' in the rest of the plan
>>>>
>>>> 2. Queries with expressions on a column which has a secondary index
>>>>
>>>> select * from users where country='UK' and some_other_column='foo';
>>>>
>>>> We have many users, and let's say 10% of them are from UK (still a lot). We
>>>> have a secondary index in the country column in the source table so we can
>>>> do an efficient index scan to retrieve the matching records.
>>>>
>>>> I found this document
>>>> https://calcite.apache.org/docs/materialized_views.html which seems like it
>>>> might help me in some way.
>>>>
>>>> The idea being if I can think of my indexes as materialized views then the
>>>> query can be written against those materialized views as sources instead of
>>>> the original table sources. There appears to be a rule
>>>> 'MaterializedViewRule' that does this already (?).
>>>>
>>>> This seems to get me a bit further, however, for this approach to work, it
>>>> seems I would have to create materialized views _dynamically_ during
>>>> evaluation of the query, register them, rewrite the query, execute it, then
>>>> deregister the materialized view.
>>>>
>>>> E.g. for the primary key lookup example above, for the following query:
>>>>
>>>> select * from some_table where key_column=23 AND some_other_column='foo';
>>>>
>>>> I would need to dynamically create a materialized view corresponding to:
>>>>
>>>> select * from some_table where key_column=23
>>>>
>>>> Then rewrite the query using MaterializedViewRule.
>>>>
>>>> In the general case, in order to figure out what materialized views I need
>>>> to dynamically create I would need to examine the query, figure out which
>>>> columns in expressions have indexes on them and from them work out the best
>>>> materialized view to create based on that information. This seems non
>>>> trivial.
>>>>
>>>> Does anyone have any suggestions or pointers for how to implement this kind
>>>> of thing? I suspect I'm not the first person to have tried to do this, as
>>>> using indexes on tables seems a pretty common thing in many systems (?)
>>>>
>>
>>

Re: Using indexes rather than table scans with Calcite

Reply via email to