Re: Using indexes rather than table scans with Calcite

Roman Kondakov Mon, 01 Jun 2020 12:00:53 -0700

Hi Xiening,

the example was synthetic. What is still not clear for me is how to
exploit index sortedness with a rule based approach. As far as I
understand with this approach we need to write complex rules (for
example [1]) that should decide which index is useful and which is not
useful. These rules should take into account both statistics and
collations, so they do some part of work that should be done by a cost
model. And it makes writing such rules quite a difficult task.


With a materialized views approach we can register all indexes as scans,
push filters to them if needed. And the cost model, not a rule, will
decide which index is better based on its cost and output collation.

So the benefits of rule based approach are not so obvious to me. I would
really appreciate if you could tell me in what cases rule-based approach
is better. I understand that its definitely better in scenarios when the
number of indexes is very high. But may be there are some other advantages?

Thank you!

[1]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/index/rules/DbScanToIndexScanPrule.java

-- 
Kind Regards
Roman Kondakov


On 01.06.2020 21:00, Xiening Dai wrote:
> Hi Roman,
> 
> The example you mentioned is an advanced scenario. Note that there are 
> different types of index, such as clustered index, secondary index, covered 
> and non-covered index. In your case, typical OLTP/OLAP optimizer would create 
> an index-based join on top of the range table scan (or FilteredTableScan in 
> your term). And these transformation can definitely be based on rules. But 
> the difficult part is actually the statistics estimation and cost 
> calculation. You could end up with higher runtime cost with index based join 
> when join cardinality is high.
> 
> But back to the original question, if we’d like to leverage index on table 
> scan, I think simple rule would serve the purpose. In fact, we have 
> FilterTableScanPredicatePushdownRule in our system which does exactly the 
> same thing.
> 
>> On May 31, 2020, at 12:45 PM, Roman Kondakov <kondako...@mail.ru.INVALID> 
>> wrote:
>>
>> Hi Vladimir,
>>
>> thank you for sharing your point. Could you please clarify some details
>> with a rulse-based index selection? You said
>>
>>> the fundamental problem with "indexes as materialized
>>> views" approach is that you have to register them beforehand, instead of
>>> using them only when needed.
>>
>> I agree, it's kind of a problem. What is not clear for me with
>> IndexScanRule-based approach is how to decide when and which index we
>> need? I understand that is is pretty easy to do in the case like this:
>>
>> Filter
>>  Scan
>>
>> we can match the IndexScanRule on this pattern and do an index lookup
>> using filter condition. But what to do in the more complex scenarios?
>> Let's consider an example
>>
>> SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100
>>
>> where A.a, A.c and B.b are indexed fields. The logical plan for this
>> query might look like this:
>>
>> LogicalJoin(A.a=B.b)
>>  LogicalFilter(A.c > 100)
>>    LogicalScan(A)
>>  LogicalScan(B)
>>
>> as I understand (please correct me if I'm wrong), with the rule-based
>> approach, after allpying IndexScanRule the plan will look like this:
>>
>> LogicalJoin(A.a=B.b)
>>  PhysicalIndexScan(A.c, lower bound = 100)
>>  PhysicalTableScan(B)
>>
>> But in this case we lose the possibility of using index scans over A.a
>> and B.b and joining them with MergeJoin, which can be more efficient
>> plan in terms of the cost.
>>
>> My question is: how rule-based approach handle this scenario? Will it
>> re-apply IndexScanRule once again to produce PhysicalIndexScan(A.a) and
>> PhysicalIndexScan(B.b)? Or am I missing the crucial point of a rule-base
>> approach?
>>
>> Thank you in advance!
>>
>>
>> -- 
>> Kind Regards
>> Roman Kondakov
>>
>>
>> On 31.05.2020 21:39, Vladimir Ozerov wrote:
>>> As already mentioned, the fundamental problem with "indexes as materialized
>>> views" approach is that you have to register them beforehand, instead of
>>> using them only when needed. On the other hand, the complexity of index
>>> planning comes from cost estimation and predicate splitting.
>>> Materializations cannot help you with that anyhow. This is why I call this
>>> approach (not materialized views per se) "hacky" - you reuse several simple
>>> parts of the Calcite infrastructure at the cost of loss in the flexibility
>>> of the planner, while the most complicated parts still need to be
>>> implemented by hand.
>>>
>>> Materialized views could be a good fit e.g for partial indexes because in
>>> this case, Calcite could help you with complex subsumption mechanics. But
>>> for standard indexes, the pros/cons balance is not that obvious.
>>>
>>> вс, 31 мая 2020 г. в 19:28, xu <xuzh1...@gmail.com>:
>>>
>>>> Hi Tim,
>>>>
>>>> I am working on MySQL InnoDB adapter and trying to introduce this to
>>>> Calcite, currently it is only in early stage, and not approved/reviewed by
>>>> committers yet. Anyway, we are facing the same problem like what index to
>>>> use, how to push down order by operation, etc. I have developed a simple
>>>> rule based adapter to be "index aware" and being able to leverage a MySQL
>>>> InnoDB storage engine written in Java. Hope this will help you to explore
>>>> more options.
>>>>
>>>> https://issues.apache.org/jira/browse/CALCITE-4034
>>>>
>>>> Thanks,
>>>> Xu
>>>>
>>>> Haisheng Yuan <hy...@apache.org> 于2020年5月31日周日 下午10:06写道：
>>>>
>>>>> Hi Roman,
>>>>>
>>>>> Thank you for sharing your thoughts.
>>>>>
>>>>>> It can be very tricky because the rule should consider not
>>>>>> only filters, but also collations. This leads to increasing the
>>>>>> complexity of such rules.
>>>>>
>>>>> Logical transformation rules like FilterTableScan2IndexTableScanRule
>>>>> should not consider physical properties, like collation, distribution.
>>>> You
>>>>> forgot that we just reached consensus in CALCITE-3972. :)
>>>>>
>>>>> Regarding the 2nd option that uses index b, it is indeed not that easy
>>>> for
>>>>> Calcite 1.22.0. In latest version, it now becomes possible. After rule
>>>>> transformation, during top-down trait request, the collation is passed
>>>>> through Filter, down to physical TableScan, which accepts the trait
>>>> request
>>>>> with collation on b, find there is an index on b, and return a new
>>>> RelNode
>>>>> IndexScan. This process can be done in
>>>>> EnumerableTableScan#passThrough(required).
>>>>>
>>>>>> I hope, when the Cascades-style optimizer become a part of Calcite, the
>>>>> search space
>>>>>> pollution will not be a serious issue anymore.
>>>>>
>>>>> I hope so too. Top-down style can help alleviate space pollution by space
>>>>> pruning. But the space pollution caused by LogicalProject operator and
>>>> its
>>>>> related rules still can't be avoided. :)
>>>>>
>>>>> Again, thanks for sharing your experience with us.
>>>>>
>>>>> Haisheng
>>>>>
>>>>> On 2020/05/31 09:58:36, Roman Kondakov <kondako...@mail.ru.INVALID>
>>>>> wrote:
>>>>>> Hi Haisheng,
>>>>>>
>>>>>> The basic rationale behind the using materialized views for secondary
>>>>>> index representation instead of special rules like mentioned
>>>>>> FilterTableScan2IndexTableScanRule is the simplicity of implementation.
>>>>>>
>>>>>> You are absolutely right that materialized views approach has an
>>>> obvious
>>>>>> drawback that it should register all indexes as materialized views in
>>>>>> the optimizer's search space. But we expect our users will not overuse
>>>>>> indexes because in general having too many indexes is a bad practice:
>>>> on
>>>>>> each table update we also should update it's indexes and it can cause
>>>>>> some performance degradation of the system as a whole, not only
>>>>>> optimizer. So we expect the number of indexes will be relatively small.
>>>>>>
>>>>>> Ignite uses indexes not only for index lookups, but also for exploiting
>>>>>> it's sortedness. In this case materialized view's approach can show
>>>> some
>>>>>> advantages. Let's consider the example:
>>>>>>
>>>>>> SELECT * FROM foo WHERE a > 100 ORDER BY b;
>>>>>>
>>>>>> where both fields 'a' and 'b' are indexed. In this case we will have
>>>> two
>>>>>> alternatives of the query execution:
>>>>>>
>>>>>> 1. Use index 'a' for index conditioned scan and then sort rows by 'b'
>>>>>> 2. Use index scan over 'b' and then apply filter over 'a' - here we can
>>>>>> avoid sorting, because index over 'b' is already sorted.
>>>>>>
>>>>>> If I understand the approach with FilterTableScan2IndexTableScanRule
>>>>>> correctly, at this step the rule should make a decision about which
>>>>>> index to use. It can be very tricky because the rule should consider
>>>> not
>>>>>> only filters, but also collations. This leads to increasing the
>>>>>> complexity of such rules. With materialized views approach we just
>>>>>> register all indexes with their caollation and let the cost model do
>>>> its
>>>>>> job. Our rules are very simple. The complexity is encapsulated in the
>>>>>> index scan cost estimation.
>>>>>>
>>>>>> As for your example with disjunctive predicate:
>>>>>>
>>>>>>> SELECT * FROM foo WHERE a > 100 or b < 1000;
>>>>>>
>>>>>> I think with materialized views approach we can do a complex logic as
>>>>>> well. For example, we may implement a special rule for such cases:
>>>>>> BitmapOrRule. Or, even better, we will rewrite such predicates to a
>>>>>> UNION ALL clause. A very good explanation and a comparison of both
>>>>>> approaches I've found in this Oracle's blog post [1].
>>>>>>
>>>>>> As a conclusion: choosing between materialized views approach and
>>>>>> IndexRule-based approach is a trade-off between complexity of
>>>>>> implementation and search space pollution. I hope, when the
>>>>>> Cascades-style optimizer become a part of Calcite, the search space
>>>>>> pollution will not be a serious issue anymore.
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>>
>>>>>
>>>> https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion
>>>>>>
>>>>>> --
>>>>>> Kind Regards
>>>>>> Roman Kondakov
>>>>>>
>>>>>>
>>>>>> On 31.05.2020 07:31, Haisheng Yuan wrote:
>>>>>>> Thanks Julian and Roman for sharing the experiences for modeling
>>>>> indexes.
>>>>>>>
>>>>>>> Besides using materialized views, which is already proven by Phoenix
>>>>> and Ignite, there is another approach, as mentioned by Vladimir, define
>>>>> your own rules and indexscan operators.
>>>>>>>
>>>>>>> FilterTableScan2IndexScanRule and its variances, which match Filter
>>>>> over TableScan, create an IndexScan if the table has corresponding index
>>>> on
>>>>> the filter column. You don't have to register different tablescans for
>>>> all
>>>>> the indexes, in case you have nearly thousands indexes for a table, say
>>>>> 999, which is allowed in SQL Server 2016. It can also support more
>>>> complex
>>>>> scenario, e.g.:
>>>>>>>
>>>>>>> SELECT * FROM foo WHERE a > 100 or b < 1000;
>>>>>>> If there is an index on column a, and another index on column b, we
>>>>> may need to utilize both indexes, through Bitmap TableScan.
>>>>>>>
>>>>>>> Bitmap Table Scan on foo
>>>>>>>   ->  BitmapOR
>>>>>>>         ->  Bitmap Index Scan on foo_index_a
>>>>>>>               Index Condition: (a > 100)
>>>>>>>         ->  Bitmap Index Scan on foo_index_b
>>>>>>>               Index Condition: (b < 1000)
>>>>>>>
>>>>>>> But still, this approach requires some non-trivial work to do.
>>>>>>>
>>>>>>> Hi Roman, I believe you definitely have consulted both approaches for
>>>>> Apache Ignite to work with indexes, you decided to go with materialized
>>>>> views, there are some reasons and tradeoffs to consider behind your
>>>>> decision. Do you mind sharing with us?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haisheng
>>>>>>>
>>>>>>>
>>>>>>> On 2020/05/29 17:34:27, Julian Hyde <jh...@apache.org> wrote:
>>>>>>>> Materialized views are not a hack, as Vladimir claims. Materialized
>>>>> views are a fundamental concept in relational algebra, and they are an
>>>>> elegant way - in my opinion the correct way -  to model indexes (and many
>>>>> other structures).
>>>>>>>>
>>>>>>>> In Calcite materialized views are a feature in the planner that
>>>>> allows you to declare a table as equivalent to a query. (You do not need
>>>> to
>>>>> issue a CREATE MATERIALIZED VIEW statement. They are internal.) Then you
>>>>> can use algebra to substitute one for the other, and apply cost-based
>>>>> optimization to choose between plans with & without the materialized
>>>> view.
>>>>>>>>
>>>>>>>>
>>>>>>>> Other people have used this approach in the past. Search this list
>>>>> and you should find discussions. Also I have a talk with Maryann Xue
>>>> about
>>>>> planning for indexes in Apache Phoenix[1].
>>>>>>>>
>>>>>>>> Julian
>>>>>>>>
>>>>>>>> [1]
>>>>>
>>>> https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
>>>>> <
>>>>>
>>>> https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
>>>>>>
>>>>>>>>
>>>>>>>>> On May 29, 2020, at 5:28 AM, Roman Kondakov
>>>>> <kondako...@mail.ru.INVALID> wrote:
>>>>>>>>>
>>>>>>>>> Hi Tim,
>>>>>>>>>
>>>>>>>>> In Apache Ignite we've already faced this challenge. We solved it
>>>>> using
>>>>>>>>> materialized views and FilterableTable. Let's consider your
>>>> example:
>>>>>>>>>
>>>>>>>>>> select * from users where country='UK' and
>>>> some_other_column='foo';
>>>>>>>>>
>>>>>>>>> with a primary index and a sorted secondary index (B+Tree?) over
>>>> the
>>>>>>>>> 'country' field.
>>>>>>>>>
>>>>>>>>> As a first step Ignite registers all indexes that might be helpful
>>>>> for
>>>>>>>>> the query plan in VolcanoPlanner as materialized views. For now we
>>>>>>>>> register all indexes in all tables that participate in the query.
>>>>>>>>> Registering all indexes might be excessive, but maybe we will apply
>>>>> some
>>>>>>>>> pruning later. So it's ok for now to register all indexes.
>>>>>>>>>
>>>>>>>>> Index materialized view is very simple: it's just a sorted table
>>>>> scan:
>>>>>>>>>
>>>>>>>>> 'SELECT * from tbl ORDER BY idx_fields'
>>>>>>>>>
>>>>>>>>> After registering indexes as materialized views, the optimizer's
>>>>> search
>>>>>>>>> space will look like this:
>>>>>>>>>
>>>>>>>>> Project([*])
>>>>>>>>> Filter[country='UK' and some_other_column='foo']
>>>>>>>>>   Set0
>>>>>>>>>
>>>>>>>>> where Set0 consists of table and index scans:
>>>>>>>>>
>>>>>>>>> TableScan('tbl', collation=[])
>>>>>>>>> IndexScan('PK', collation=[PK_field])
>>>>>>>>> IndexScan('country_idx', collation=[country])
>>>>>>>>>
>>>>>>>>> At this step we have our index scans registered in the optimizer
>>>>> within
>>>>>>>>> the same equivalence set as a TableScan.
>>>>>>>>>
>>>>>>>>> The next step is a pushing filters down to the scans. We do it
>>>> with a
>>>>>>>>> rule which is similar to 'FilterTableScanRule'. After applying this
>>>>> rule
>>>>>>>>> we have a search space that is looking like this:
>>>>>>>>>
>>>>>>>>> Project([*])
>>>>>>>>> TableScan('tbl', collation=[], filter=[country='UK' and
>>>>>>>>> some_other_column='foo'])
>>>>>>>>> IndexScan('PK', collation=[PK_field], filter=[country='UK' and
>>>>>>>>> some_other_column='foo'])
>>>>>>>>> IndexScan('country_idx', collation=[country], filter=[country='UK'
>>>>> and
>>>>>>>>> some_other_column='foo'])
>>>>>>>>>
>>>>>>>>> And the final step is adjusting the cost model to make it select
>>>> the
>>>>>>>>> scan with the lower cost which depends on the filter conditions
>>>>> within
>>>>>>>>> the Scan. For example, full table scan with filters
>>>>>>>>>
>>>>>>>>> TableScan('tbl', collation=[], filter=[country='UK' and
>>>>>>>>> some_other_column='foo'])
>>>>>>>>>
>>>>>>>>> will cost, say, 100. Because it have to scan all rows and then
>>>> filter
>>>>>>>>> out some set of them. On the other hand the index scan that can do
>>>>> index
>>>>>>>>> lookup instead of full scan will have a less cost. For example
>>>>>>>>>
>>>>>>>>> IndexScan('country_idx', collation=[country], filter=[country='UK'
>>>>> and
>>>>>>>>> some_other_column='foo'])
>>>>>>>>>
>>>>>>>>> will have cost about ~10, or so because it has a good index
>>>> condition
>>>>>>>>> `country='UK'` which can be used for index lookup that that returns
>>>>> only
>>>>>>>>> 10% of rows. And therefore this IndexScan should be chosen as the
>>>>> best
>>>>>>>>> plan by the optimizer.
>>>>>>>>>
>>>>>>>>> We've recently implemented this approach in Apache Ignite and it
>>>>> works
>>>>>>>>> well for us. You can find it in [1]. This PR has many changes that
>>>>> are
>>>>>>>>> unrelated to the main topic. So particularly you can look at
>>>>>>>>> `IgnitePlanner.materializations()' method which registers indexes
>>>> as
>>>>>>>>> materialized views and `IgniteTableScan` which performs filter
>>>>>>>>> conditions assessment and index scan cost estimation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://github.com/apache/ignite/pull/7813
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Kind Regards
>>>>>>>>> Roman Kondakov
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 29.05.2020 11:44, Tim Fox wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm building a query engine with Calcite - really enjoying working
>>>>> with
>>>>>>>>>> Calcite so far!
>>>>>>>>>>
>>>>>>>>>> When creating a plan, it seems Calcite always creates a plan where
>>>>> the
>>>>>>>>>> sources are table scans, however in my implementation the tables
>>>>> can have
>>>>>>>>>> indexes on them so a table scan is not always the right choice.
>>>>>>>>>>
>>>>>>>>>> I was wondering if there was any way of making Calcite "index
>>>>> aware" - e.g.
>>>>>>>>>> perhaps providing hints to the table scan instance that, actually,
>>>>> an index
>>>>>>>>>> scan or a primary key lookup should be used instead of actually
>>>>> scanning
>>>>>>>>>> the table. E.g. On the table meta-data if we provided information
>>>>> about any
>>>>>>>>>> indexes on the table, then Calcite could figure out what parts of
>>>>> the query
>>>>>>>>>> to push to the table scan and which to keep in the rest of the
>>>> plan.
>>>>>>>>>>
>>>>>>>>>> There are two specific cases I really care about:
>>>>>>>>>>
>>>>>>>>>> 1. Queries that contain a primary key lookup:
>>>>>>>>>>
>>>>>>>>>> select * from some_table where key_column=23 AND
>>>>> some_other_column='foo';
>>>>>>>>>>
>>>>>>>>>> In the above case the 'select * from some_table where
>>>>> key_column=23' can be
>>>>>>>>>> implemented as a simple PK lookup in the source table, not
>>>>> requiring a
>>>>>>>>>> scan, thus leaving just the filter corresponding to
>>>>>>>>>> 'some_other_column='foo'' in the rest of the plan
>>>>>>>>>>
>>>>>>>>>> 2. Queries with expressions on a column which has a secondary
>>>> index
>>>>>>>>>>
>>>>>>>>>> select * from users where country='UK' and
>>>> some_other_column='foo';
>>>>>>>>>>
>>>>>>>>>> We have many users, and let's say 10% of them are from UK (still a
>>>>> lot). We
>>>>>>>>>> have a secondary index in the country column in the source table
>>>> so
>>>>> we can
>>>>>>>>>> do an efficient index scan to retrieve the matching records.
>>>>>>>>>>
>>>>>>>>>> I found this document
>>>>>>>>>> https://calcite.apache.org/docs/materialized_views.html which
>>>>> seems like it
>>>>>>>>>> might help me in some way.
>>>>>>>>>>
>>>>>>>>>> The idea being if I can think of my indexes as materialized views
>>>>> then the
>>>>>>>>>> query can be written against those materialized views as sources
>>>>> instead of
>>>>>>>>>> the original table sources. There appears to be a rule
>>>>>>>>>> 'MaterializedViewRule' that does this already (?).
>>>>>>>>>>
>>>>>>>>>> This seems to get me a bit further, however, for this approach to
>>>>> work, it
>>>>>>>>>> seems I would have to create materialized views _dynamically_
>>>> during
>>>>>>>>>> evaluation of the query, register them, rewrite the query, execute
>>>>> it, then
>>>>>>>>>> deregister the materialized view.
>>>>>>>>>>
>>>>>>>>>> E.g. for the primary key lookup example above, for the following
>>>>> query:
>>>>>>>>>>
>>>>>>>>>> select * from some_table where key_column=23 AND
>>>>> some_other_column='foo';
>>>>>>>>>>
>>>>>>>>>> I would need to dynamically create a materialized view
>>>>> corresponding to:
>>>>>>>>>>
>>>>>>>>>> select * from some_table where key_column=23
>>>>>>>>>>
>>>>>>>>>> Then rewrite the query using MaterializedViewRule.
>>>>>>>>>>
>>>>>>>>>> In the general case, in order to figure out what materialized
>>>> views
>>>>> I need
>>>>>>>>>> to dynamically create I would need to examine the query, figure
>>>> out
>>>>> which
>>>>>>>>>> columns in expressions have indexes on them and from them work out
>>>>> the best
>>>>>>>>>> materialized view to create based on that information. This seems
>>>>> non
>>>>>>>>>> trivial.
>>>>>>>>>>
>>>>>>>>>> Does anyone have any suggestions or pointers for how to implement
>>>>> this kind
>>>>>>>>>> of thing? I suspect I'm not the first person to have tried to do
>>>>> this, as
>>>>>>>>>> using indexes on tables seems a pretty common thing in many
>>>> systems
>>>>> (?)
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Best regards,
>>>>
>>>> Xu
>>>>
>>>
>

Re: Using indexes rather than table scans with Calcite

Reply via email to