Hi Tim, I am working on MySQL InnoDB adapter and trying to introduce this to Calcite, currently it is only in early stage, and not approved/reviewed by committers yet. Anyway, we are facing the same problem like what index to use, how to push down order by operation, etc. I have developed a simple rule based adapter to be "index aware" and being able to leverage a MySQL InnoDB storage engine written in Java. Hope this will help you to explore more options.
https://issues.apache.org/jira/browse/CALCITE-4034 Thanks, Xu Haisheng Yuan <hy...@apache.org> 于2020年5月31日周日 下午10:06写道: > Hi Roman, > > Thank you for sharing your thoughts. > > > It can be very tricky because the rule should consider not > > only filters, but also collations. This leads to increasing the > > complexity of such rules. > > Logical transformation rules like FilterTableScan2IndexTableScanRule > should not consider physical properties, like collation, distribution. You > forgot that we just reached consensus in CALCITE-3972. :) > > Regarding the 2nd option that uses index b, it is indeed not that easy for > Calcite 1.22.0. In latest version, it now becomes possible. After rule > transformation, during top-down trait request, the collation is passed > through Filter, down to physical TableScan, which accepts the trait request > with collation on b, find there is an index on b, and return a new RelNode > IndexScan. This process can be done in > EnumerableTableScan#passThrough(required). > > > I hope, when the Cascades-style optimizer become a part of Calcite, the > search space > > pollution will not be a serious issue anymore. > > I hope so too. Top-down style can help alleviate space pollution by space > pruning. But the space pollution caused by LogicalProject operator and its > related rules still can't be avoided. :) > > Again, thanks for sharing your experience with us. > > Haisheng > > On 2020/05/31 09:58:36, Roman Kondakov <kondako...@mail.ru.INVALID> > wrote: > > Hi Haisheng, > > > > The basic rationale behind the using materialized views for secondary > > index representation instead of special rules like mentioned > > FilterTableScan2IndexTableScanRule is the simplicity of implementation. > > > > You are absolutely right that materialized views approach has an obvious > > drawback that it should register all indexes as materialized views in > > the optimizer's search space. But we expect our users will not overuse > > indexes because in general having too many indexes is a bad practice: on > > each table update we also should update it's indexes and it can cause > > some performance degradation of the system as a whole, not only > > optimizer. So we expect the number of indexes will be relatively small. > > > > Ignite uses indexes not only for index lookups, but also for exploiting > > it's sortedness. In this case materialized view's approach can show some > > advantages. Let's consider the example: > > > > SELECT * FROM foo WHERE a > 100 ORDER BY b; > > > > where both fields 'a' and 'b' are indexed. In this case we will have two > > alternatives of the query execution: > > > > 1. Use index 'a' for index conditioned scan and then sort rows by 'b' > > 2. Use index scan over 'b' and then apply filter over 'a' - here we can > > avoid sorting, because index over 'b' is already sorted. > > > > If I understand the approach with FilterTableScan2IndexTableScanRule > > correctly, at this step the rule should make a decision about which > > index to use. It can be very tricky because the rule should consider not > > only filters, but also collations. This leads to increasing the > > complexity of such rules. With materialized views approach we just > > register all indexes with their caollation and let the cost model do its > > job. Our rules are very simple. The complexity is encapsulated in the > > index scan cost estimation. > > > > As for your example with disjunctive predicate: > > > > > SELECT * FROM foo WHERE a > 100 or b < 1000; > > > > I think with materialized views approach we can do a complex logic as > > well. For example, we may implement a special rule for such cases: > > BitmapOrRule. Or, even better, we will rewrite such predicates to a > > UNION ALL clause. A very good explanation and a comparison of both > > approaches I've found in this Oracle's blog post [1]. > > > > As a conclusion: choosing between materialized views approach and > > IndexRule-based approach is a trade-off between complexity of > > implementation and search space pollution. I hope, when the > > Cascades-style optimizer become a part of Calcite, the search space > > pollution will not be a serious issue anymore. > > > > > > Thanks! > > > > > > [1] > > > https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion > > > > -- > > Kind Regards > > Roman Kondakov > > > > > > On 31.05.2020 07:31, Haisheng Yuan wrote: > > > Thanks Julian and Roman for sharing the experiences for modeling > indexes. > > > > > > Besides using materialized views, which is already proven by Phoenix > and Ignite, there is another approach, as mentioned by Vladimir, define > your own rules and indexscan operators. > > > > > > FilterTableScan2IndexScanRule and its variances, which match Filter > over TableScan, create an IndexScan if the table has corresponding index on > the filter column. You don't have to register different tablescans for all > the indexes, in case you have nearly thousands indexes for a table, say > 999, which is allowed in SQL Server 2016. It can also support more complex > scenario, e.g.: > > > > > > SELECT * FROM foo WHERE a > 100 or b < 1000; > > > If there is an index on column a, and another index on column b, we > may need to utilize both indexes, through Bitmap TableScan. > > > > > > Bitmap Table Scan on foo > > > -> BitmapOR > > > -> Bitmap Index Scan on foo_index_a > > > Index Condition: (a > 100) > > > -> Bitmap Index Scan on foo_index_b > > > Index Condition: (b < 1000) > > > > > > But still, this approach requires some non-trivial work to do. > > > > > > Hi Roman, I believe you definitely have consulted both approaches for > Apache Ignite to work with indexes, you decided to go with materialized > views, there are some reasons and tradeoffs to consider behind your > decision. Do you mind sharing with us? > > > > > > Thanks, > > > Haisheng > > > > > > > > > On 2020/05/29 17:34:27, Julian Hyde <jh...@apache.org> wrote: > > >> Materialized views are not a hack, as Vladimir claims. Materialized > views are a fundamental concept in relational algebra, and they are an > elegant way - in my opinion the correct way - to model indexes (and many > other structures). > > >> > > >> In Calcite materialized views are a feature in the planner that > allows you to declare a table as equivalent to a query. (You do not need to > issue a CREATE MATERIALIZED VIEW statement. They are internal.) Then you > can use algebra to substitute one for the other, and apply cost-based > optimization to choose between plans with & without the materialized view. > > >> > > >> > > >> Other people have used this approach in the past. Search this list > and you should find discussions. Also I have a talk with Maryann Xue about > planning for indexes in Apache Phoenix[1]. > > >> > > >> Julian > > >> > > >> [1] > https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite > < > https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite > > > > >> > > >>> On May 29, 2020, at 5:28 AM, Roman Kondakov > <kondako...@mail.ru.INVALID> wrote: > > >>> > > >>> Hi Tim, > > >>> > > >>> In Apache Ignite we've already faced this challenge. We solved it > using > > >>> materialized views and FilterableTable. Let's consider your example: > > >>> > > >>>> select * from users where country='UK' and some_other_column='foo'; > > >>> > > >>> with a primary index and a sorted secondary index (B+Tree?) over the > > >>> 'country' field. > > >>> > > >>> As a first step Ignite registers all indexes that might be helpful > for > > >>> the query plan in VolcanoPlanner as materialized views. For now we > > >>> register all indexes in all tables that participate in the query. > > >>> Registering all indexes might be excessive, but maybe we will apply > some > > >>> pruning later. So it's ok for now to register all indexes. > > >>> > > >>> Index materialized view is very simple: it's just a sorted table > scan: > > >>> > > >>> 'SELECT * from tbl ORDER BY idx_fields' > > >>> > > >>> After registering indexes as materialized views, the optimizer's > search > > >>> space will look like this: > > >>> > > >>> Project([*]) > > >>> Filter[country='UK' and some_other_column='foo'] > > >>> Set0 > > >>> > > >>> where Set0 consists of table and index scans: > > >>> > > >>> TableScan('tbl', collation=[]) > > >>> IndexScan('PK', collation=[PK_field]) > > >>> IndexScan('country_idx', collation=[country]) > > >>> > > >>> At this step we have our index scans registered in the optimizer > within > > >>> the same equivalence set as a TableScan. > > >>> > > >>> The next step is a pushing filters down to the scans. We do it with a > > >>> rule which is similar to 'FilterTableScanRule'. After applying this > rule > > >>> we have a search space that is looking like this: > > >>> > > >>> Project([*]) > > >>> TableScan('tbl', collation=[], filter=[country='UK' and > > >>> some_other_column='foo']) > > >>> IndexScan('PK', collation=[PK_field], filter=[country='UK' and > > >>> some_other_column='foo']) > > >>> IndexScan('country_idx', collation=[country], filter=[country='UK' > and > > >>> some_other_column='foo']) > > >>> > > >>> And the final step is adjusting the cost model to make it select the > > >>> scan with the lower cost which depends on the filter conditions > within > > >>> the Scan. For example, full table scan with filters > > >>> > > >>> TableScan('tbl', collation=[], filter=[country='UK' and > > >>> some_other_column='foo']) > > >>> > > >>> will cost, say, 100. Because it have to scan all rows and then filter > > >>> out some set of them. On the other hand the index scan that can do > index > > >>> lookup instead of full scan will have a less cost. For example > > >>> > > >>> IndexScan('country_idx', collation=[country], filter=[country='UK' > and > > >>> some_other_column='foo']) > > >>> > > >>> will have cost about ~10, or so because it has a good index condition > > >>> `country='UK'` which can be used for index lookup that that returns > only > > >>> 10% of rows. And therefore this IndexScan should be chosen as the > best > > >>> plan by the optimizer. > > >>> > > >>> We've recently implemented this approach in Apache Ignite and it > works > > >>> well for us. You can find it in [1]. This PR has many changes that > are > > >>> unrelated to the main topic. So particularly you can look at > > >>> `IgnitePlanner.materializations()' method which registers indexes as > > >>> materialized views and `IgniteTableScan` which performs filter > > >>> conditions assessment and index scan cost estimation. > > >>> > > >>> > > >>> [1] https://github.com/apache/ignite/pull/7813 > > >>> > > >>> -- > > >>> Kind Regards > > >>> Roman Kondakov > > >>> > > >>> > > >>> On 29.05.2020 11:44, Tim Fox wrote: > > >>>> Hi, > > >>>> > > >>>> I'm building a query engine with Calcite - really enjoying working > with > > >>>> Calcite so far! > > >>>> > > >>>> When creating a plan, it seems Calcite always creates a plan where > the > > >>>> sources are table scans, however in my implementation the tables > can have > > >>>> indexes on them so a table scan is not always the right choice. > > >>>> > > >>>> I was wondering if there was any way of making Calcite "index > aware" - e.g. > > >>>> perhaps providing hints to the table scan instance that, actually, > an index > > >>>> scan or a primary key lookup should be used instead of actually > scanning > > >>>> the table. E.g. On the table meta-data if we provided information > about any > > >>>> indexes on the table, then Calcite could figure out what parts of > the query > > >>>> to push to the table scan and which to keep in the rest of the plan. > > >>>> > > >>>> There are two specific cases I really care about: > > >>>> > > >>>> 1. Queries that contain a primary key lookup: > > >>>> > > >>>> select * from some_table where key_column=23 AND > some_other_column='foo'; > > >>>> > > >>>> In the above case the 'select * from some_table where > key_column=23' can be > > >>>> implemented as a simple PK lookup in the source table, not > requiring a > > >>>> scan, thus leaving just the filter corresponding to > > >>>> 'some_other_column='foo'' in the rest of the plan > > >>>> > > >>>> 2. Queries with expressions on a column which has a secondary index > > >>>> > > >>>> select * from users where country='UK' and some_other_column='foo'; > > >>>> > > >>>> We have many users, and let's say 10% of them are from UK (still a > lot). We > > >>>> have a secondary index in the country column in the source table so > we can > > >>>> do an efficient index scan to retrieve the matching records. > > >>>> > > >>>> I found this document > > >>>> https://calcite.apache.org/docs/materialized_views.html which > seems like it > > >>>> might help me in some way. > > >>>> > > >>>> The idea being if I can think of my indexes as materialized views > then the > > >>>> query can be written against those materialized views as sources > instead of > > >>>> the original table sources. There appears to be a rule > > >>>> 'MaterializedViewRule' that does this already (?). > > >>>> > > >>>> This seems to get me a bit further, however, for this approach to > work, it > > >>>> seems I would have to create materialized views _dynamically_ during > > >>>> evaluation of the query, register them, rewrite the query, execute > it, then > > >>>> deregister the materialized view. > > >>>> > > >>>> E.g. for the primary key lookup example above, for the following > query: > > >>>> > > >>>> select * from some_table where key_column=23 AND > some_other_column='foo'; > > >>>> > > >>>> I would need to dynamically create a materialized view > corresponding to: > > >>>> > > >>>> select * from some_table where key_column=23 > > >>>> > > >>>> Then rewrite the query using MaterializedViewRule. > > >>>> > > >>>> In the general case, in order to figure out what materialized views > I need > > >>>> to dynamically create I would need to examine the query, figure out > which > > >>>> columns in expressions have indexes on them and from them work out > the best > > >>>> materialized view to create based on that information. This seems > non > > >>>> trivial. > > >>>> > > >>>> Does anyone have any suggestions or pointers for how to implement > this kind > > >>>> of thing? I suspect I'm not the first person to have tried to do > this, as > > >>>> using indexes on tables seems a pretty common thing in many systems > (?) > > >>>> > > >> > > >> > > > -- Best regards, Xu