Vladimir, Haisheng, thank you for sharing your thoughts. And special thanks to Haisheng for awesome examples.
I found Julian's reasoning about representing bitmap indexes as joins very deep and interesting. As I understand this idea right, for table A(a,b,c) indexes over columns, say, 'a' and 'b' can be represented as relations sorted by 'a' and 'b': idx_a('rowId', 'a') [collation=[a]] idx_b('rowId', 'b') [collation=[b]] Table A itself may be represented as a relation scan_A('rowId', 'a', 'b', 'c') [collation=[]] using this approach, bitmap indexes may be represented as set operations (union, intersect) over idx_a and idx_b followed by joining with scan_A using 'rowId'. Index-only scans (aka covered indexes), can be modeled as scans over idx_a or idx_b without consequent join with the main table. @Julian, did I get your idea right? If one try to apply this approach directly, then the search space will be quickly polluted by those index relations. But if the Lattice-like framework from the Julian's mail is implemented, then it will be a great feature, I guess. @Haisheng, BTW we are working on the indexed nested loop join right now and it looks like it can be implemented without special Join2IndexApplyRule: we'll use correlation variable from join condition as filter condition. If there is no index there - it will be a simple correlated nested loop join (or Apply). If index is present, the cost will be lower because index lookup will speed up this join, and this index scan will win. > > There is another useful feature, that is IndexApply. > > The rule of Join2IndexApply will transform a logical join to logical index > apply, then transform to physical indexed nested loop join. > select * from foo join bar on foo.a = bar.f; > There is an index on foo.a, but no index on bar.f. Thank you! -- Kind Regards Roman Kondakov On 02.06.2020 19:55, Julian Hyde wrote: > Vladimir, > > I feel the same way. MVs are more powerful and general, and with some effort > they could be just as efficient as other approaches. > > One problem that needs to be solved is the “registration problem”. If you > have a lot of MVs they all have to be registered in the planner’s search > space, otherwise the planner will not find them. > > I would like to see a benchmark for this. Say we have 1,000 MVs and we plan > 10,000 queries that potentially use them. The goal is that the planning cost > Is comparable to only having 1 MV. We could achieve the goal by re-using the > MVs across planner instances (thereby amortizing the cost of creating them) > and “index” the planner’s search space. > > The Lattice structure essentially does this for families for join-aggregate > MVs (summary tables). We need something similar for project-sort MVs > (indexes). > > Julian On 02.06.2020 10:05, Vladimir Ozerov wrote: > Hi Roman, > > To me, there is no principal difference between materialized views and > rule-based approaches - they are both "rules". In the materialized views > approach the rule is to create all alternative access paths > unconditionally, this rule is always fired first during the optimization > process. A rule-based approach gives you flexibility. You may do exactly > the same as with materialized views - just enumerate all access paths and > add them to search space. But at the same time, you may develop more > complicated things, such as pruning, bitmap index joins, etc. This is not > possible with materialized views. > > In this sense, the difference in complexity between materialized views > approach and rule-based approach with similar features (no pruning, etc) is > close to zero - in both cases you just iterate over existing indexes and > add them to search space. Rule-based approach becomes more complex when you > add more features to it. > > вт, 2 июн. 2020 г. в 00:37, Haisheng Yuan <hy...@apache.org>: > >> Hi Roman, >> >> There are some potential advantages. >> >> Let's still use this as example. >> SELECT * FROM foo WHERE a > 100 or b < 1000; >> Both column a and b have B-Tree indexes, note that it doesn't have to be >> bitmap index. >> >> Bitmap Table Scan on foo >> -> BitmapOr >> -> Bitmap Index Scan on foo_index_a >> Index Cond: (a > 100) >> -> Bitmap Index Scan on foo_index_b >> Index Cond: (b < 1000) >> >> filter a: -------------------- >> filter b: ----------------------- >> >> The bitmap scan just needs to read the tuples of satisfying the condition >> once, because the BitmapOr will eliminate duplicate tuple id. Constructing >> to UNION ALL will read the intersecting part twice, and it will increase >> the search space a little bit even when there is no available index. >> >> If we change the condition from OR to AND: >> SELECT * FROM foo WHERE a > 100 and b < 1000; >> >> Will MV-based approach have to choose which index to use? >> >> I would prefer the following plan: >> Bitmap Table Scan on foo >> -> BitmapAnd >> -> Bitmap Index Scan on foo_index_a >> Index Cond: (a > 100) >> -> Bitmap Index Scan on foo_index_b >> Index Cond: (b < 1000) >> >> filter a: -------------------- >> filter b: ----------------------- >> >> It would be extremely helpful when the intersecting range is very small, >> because it just need to read the intersected data page only to fetch tuples. >> >> This can applies to arbitrary combinations of AND/OR conditions. >> >> This is an attractive feature for data warehousing customers. >> >> Another example: >> SELECT max(a) from foo; >> >> The query can be transformed to >> SELECT a from foo order by a desc limit 1; >> >> If there is a index on column 'a' but with ASC order. >> The collation trait request will pass to TableScan, if there is an index >> on the collation key. We just need to transform to an IndexScan with >> reverse index scan direction, which is already possible in current Calcite. >> >> Similarly, the user query may be simple but the index on a is ASC order: >> SELECT * from foo order by a desc; >> >> MV-based approach would have to register all the tablescan again with >> reversed index scan direction. >> >> There is another useful feature, that is IndexApply. >> >> The rule of Join2IndexApply will transform a logical join to logical index >> apply, then transform to physical indexed nested loop join. >> select * from foo join bar on foo.a = bar.f; >> There is an index on foo.a, but no index on bar.f. >> >> Nested Loop Inner Join >> Join Filter: true >> -> TableScan on bar >> -> Index Scan using foo_index_a on foo >> Index Cond: foo.a = bar.f >> >> MV-based approach would have to sort table bar if it want to use the index. >> >> But considering Ignite is a in-memory computing platform, MV-based >> approach would be just good enough to serve its purpose. >> >> Looking at Drill's DbScanToIndexScanPrule.java, the way Apache Drill uses >> to tackle with index is not wrong, but I would not recommend to follow >> Drill's example to do it. >> >> Thanks, >> Haisheng >> >> On 2020/06/01 19:15:32, Julian Hyde <jh...@apache.org> wrote: >>> I'm pleased there are a variety of approaches. People should use >>> whichever works for them and their use case. >>> >>> The so-called "rule-based" approach is definitely useful for >>> OLTP-style queries where you are accessing a few rows and you need to >>> plan quickly. >>> >>> Haisheng mentioned bitmap indexes a while back. Optimizing queries to >>> use bitmap indexes is interesting because they can be combined in more >>> ways than regular indexes. In this case, and many others, it is worth >>> thinking about the problem at a high level. For example, combining two >>> bitmap indexes can be modeled as a join, where the join keys are the >>> record ids, and the record ids are sorted within each index value. >>> Thinking at the high level can find plans that the rule-based approach >>> will never find. >>> >>> Indexes-as-MVs, indexes-as-tables, and >>> index-filtered-table-scan-as-join are other examples of the high-level >>> approach. >>> >>> Julian >>> >>> >>> >>> On Mon, Jun 1, 2020 at 12:00 PM Roman Kondakov >>> <kondako...@mail.ru.invalid> wrote: >>>> >>>> Hi Xiening, >>>> >>>> the example was synthetic. What is still not clear for me is how to >>>> exploit index sortedness with a rule based approach. As far as I >>>> understand with this approach we need to write complex rules (for >>>> example [1]) that should decide which index is useful and which is not >>>> useful. These rules should take into account both statistics and >>>> collations, so they do some part of work that should be done by a cost >>>> model. And it makes writing such rules quite a difficult task. >>>> >>>> With a materialized views approach we can register all indexes as >> scans, >>>> push filters to them if needed. And the cost model, not a rule, will >>>> decide which index is better based on its cost and output collation. >>>> >>>> So the benefits of rule based approach are not so obvious to me. I >> would >>>> really appreciate if you could tell me in what cases rule-based >> approach >>>> is better. I understand that its definitely better in scenarios when >> the >>>> number of indexes is very high. But may be there are some other >> advantages? >>>> >>>> Thank you! >>>> >>>> [1] >>>> >> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/index/rules/DbScanToIndexScanPrule.java >>>> >>>> -- >>>> Kind Regards >>>> Roman Kondakov >>>> >>>> >>>> On 01.06.2020 21:00, Xiening Dai wrote: >>>>> Hi Roman, >>>>> >>>>> The example you mentioned is an advanced scenario. Note that there >> are different types of index, such as clustered index, secondary index, >> covered and non-covered index. In your case, typical OLTP/OLAP optimizer >> would create an index-based join on top of the range table scan (or >> FilteredTableScan in your term). And these transformation can definitely be >> based on rules. But the difficult part is actually the statistics >> estimation and cost calculation. You could end up with higher runtime cost >> with index based join when join cardinality is high. >>>>> >>>>> But back to the original question, if we’d like to leverage index on >> table scan, I think simple rule would serve the purpose. In fact, we have >> FilterTableScanPredicatePushdownRule in our system which does exactly the >> same thing. >>>>> >>>>>> On May 31, 2020, at 12:45 PM, Roman Kondakov >> <kondako...@mail.ru.INVALID> wrote: >>>>>> >>>>>> Hi Vladimir, >>>>>> >>>>>> thank you for sharing your point. Could you please clarify some >> details >>>>>> with a rulse-based index selection? You said >>>>>> >>>>>>> the fundamental problem with "indexes as materialized >>>>>>> views" approach is that you have to register them beforehand, >> instead of >>>>>>> using them only when needed. >>>>>> >>>>>> I agree, it's kind of a problem. What is not clear for me with >>>>>> IndexScanRule-based approach is how to decide when and which index >> we >>>>>> need? I understand that is is pretty easy to do in the case like >> this: >>>>>> >>>>>> Filter >>>>>> Scan >>>>>> >>>>>> we can match the IndexScanRule on this pattern and do an index >> lookup >>>>>> using filter condition. But what to do in the more complex >> scenarios? >>>>>> Let's consider an example >>>>>> >>>>>> SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100 >>>>>> >>>>>> where A.a, A.c and B.b are indexed fields. The logical plan for this >>>>>> query might look like this: >>>>>> >>>>>> LogicalJoin(A.a=B.b) >>>>>> LogicalFilter(A.c > 100) >>>>>> LogicalScan(A) >>>>>> LogicalScan(B) >>>>>> >>>>>> as I understand (please correct me if I'm wrong), with the >> rule-based >>>>>> approach, after allpying IndexScanRule the plan will look like this: >>>>>> >>>>>> LogicalJoin(A.a=B.b) >>>>>> PhysicalIndexScan(A.c, lower bound = 100) >>>>>> PhysicalTableScan(B) >>>>>> >>>>>> But in this case we lose the possibility of using index scans over >> A.a >>>>>> and B.b and joining them with MergeJoin, which can be more efficient >>>>>> plan in terms of the cost. >>>>>> >>>>>> My question is: how rule-based approach handle this scenario? Will >> it >>>>>> re-apply IndexScanRule once again to produce PhysicalIndexScan(A.a) >> and >>>>>> PhysicalIndexScan(B.b)? Or am I missing the crucial point of a >> rule-base >>>>>> approach? >>>>>> >>>>>> Thank you in advance! >>>>>> >>>>>> >>>>>> -- >>>>>> Kind Regards >>>>>> Roman Kondakov >>>>>> >>>>>> >>>>>> On 31.05.2020 21:39, Vladimir Ozerov wrote: >>>>>>> As already mentioned, the fundamental problem with "indexes as >> materialized >>>>>>> views" approach is that you have to register them beforehand, >> instead of >>>>>>> using them only when needed. On the other hand, the complexity of >> index >>>>>>> planning comes from cost estimation and predicate splitting. >>>>>>> Materializations cannot help you with that anyhow. This is why I >> call this >>>>>>> approach (not materialized views per se) "hacky" - you reuse >> several simple >>>>>>> parts of the Calcite infrastructure at the cost of loss in the >> flexibility >>>>>>> of the planner, while the most complicated parts still need to be >>>>>>> implemented by hand. >>>>>>> >>>>>>> Materialized views could be a good fit e.g for partial indexes >> because in >>>>>>> this case, Calcite could help you with complex subsumption >> mechanics. But >>>>>>> for standard indexes, the pros/cons balance is not that obvious. >>>>>>> >>>>>>> вс, 31 мая 2020 г. в 19:28, xu <xuzh1...@gmail.com>: >>>>>>> >>>>>>>> Hi Tim, >>>>>>>> >>>>>>>> I am working on MySQL InnoDB adapter and trying to introduce this >> to >>>>>>>> Calcite, currently it is only in early stage, and not >> approved/reviewed by >>>>>>>> committers yet. Anyway, we are facing the same problem like what >> index to >>>>>>>> use, how to push down order by operation, etc. I have developed a >> simple >>>>>>>> rule based adapter to be "index aware" and being able to leverage >> a MySQL >>>>>>>> InnoDB storage engine written in Java. Hope this will help you to >> explore >>>>>>>> more options. >>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/CALCITE-4034 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Xu >>>>>>>> >>>>>>>> Haisheng Yuan <hy...@apache.org> 于2020年5月31日周日 下午10:06写道: >>>>>>>> >>>>>>>>> Hi Roman, >>>>>>>>> >>>>>>>>> Thank you for sharing your thoughts. >>>>>>>>> >>>>>>>>>> It can be very tricky because the rule should consider not >>>>>>>>>> only filters, but also collations. This leads to increasing the >>>>>>>>>> complexity of such rules. >>>>>>>>> >>>>>>>>> Logical transformation rules like >> FilterTableScan2IndexTableScanRule >>>>>>>>> should not consider physical properties, like collation, >> distribution. >>>>>>>> You >>>>>>>>> forgot that we just reached consensus in CALCITE-3972. :) >>>>>>>>> >>>>>>>>> Regarding the 2nd option that uses index b, it is indeed not >> that easy >>>>>>>> for >>>>>>>>> Calcite 1.22.0. In latest version, it now becomes possible. >> After rule >>>>>>>>> transformation, during top-down trait request, the collation is >> passed >>>>>>>>> through Filter, down to physical TableScan, which accepts the >> trait >>>>>>>> request >>>>>>>>> with collation on b, find there is an index on b, and return a >> new >>>>>>>> RelNode >>>>>>>>> IndexScan. This process can be done in >>>>>>>>> EnumerableTableScan#passThrough(required). >>>>>>>>> >>>>>>>>>> I hope, when the Cascades-style optimizer become a part of >> Calcite, the >>>>>>>>> search space >>>>>>>>>> pollution will not be a serious issue anymore. >>>>>>>>> >>>>>>>>> I hope so too. Top-down style can help alleviate space pollution >> by space >>>>>>>>> pruning. But the space pollution caused by LogicalProject >> operator and >>>>>>>> its >>>>>>>>> related rules still can't be avoided. :) >>>>>>>>> >>>>>>>>> Again, thanks for sharing your experience with us. >>>>>>>>> >>>>>>>>> Haisheng >>>>>>>>> >>>>>>>>> On 2020/05/31 09:58:36, Roman Kondakov >> <kondako...@mail.ru.INVALID> >>>>>>>>> wrote: >>>>>>>>>> Hi Haisheng, >>>>>>>>>> >>>>>>>>>> The basic rationale behind the using materialized views for >> secondary >>>>>>>>>> index representation instead of special rules like mentioned >>>>>>>>>> FilterTableScan2IndexTableScanRule is the simplicity of >> implementation. >>>>>>>>>> >>>>>>>>>> You are absolutely right that materialized views approach has an >>>>>>>> obvious >>>>>>>>>> drawback that it should register all indexes as materialized >> views in >>>>>>>>>> the optimizer's search space. But we expect our users will not >> overuse >>>>>>>>>> indexes because in general having too many indexes is a bad >> practice: >>>>>>>> on >>>>>>>>>> each table update we also should update it's indexes and it can >> cause >>>>>>>>>> some performance degradation of the system as a whole, not only >>>>>>>>>> optimizer. So we expect the number of indexes will be >> relatively small. >>>>>>>>>> >>>>>>>>>> Ignite uses indexes not only for index lookups, but also for >> exploiting >>>>>>>>>> it's sortedness. In this case materialized view's approach can >> show >>>>>>>> some >>>>>>>>>> advantages. Let's consider the example: >>>>>>>>>> >>>>>>>>>> SELECT * FROM foo WHERE a > 100 ORDER BY b; >>>>>>>>>> >>>>>>>>>> where both fields 'a' and 'b' are indexed. In this case we will >> have >>>>>>>> two >>>>>>>>>> alternatives of the query execution: >>>>>>>>>> >>>>>>>>>> 1. Use index 'a' for index conditioned scan and then sort rows >> by 'b' >>>>>>>>>> 2. Use index scan over 'b' and then apply filter over 'a' - >> here we can >>>>>>>>>> avoid sorting, because index over 'b' is already sorted. >>>>>>>>>> >>>>>>>>>> If I understand the approach with >> FilterTableScan2IndexTableScanRule >>>>>>>>>> correctly, at this step the rule should make a decision about >> which >>>>>>>>>> index to use. It can be very tricky because the rule should >> consider >>>>>>>> not >>>>>>>>>> only filters, but also collations. This leads to increasing the >>>>>>>>>> complexity of such rules. With materialized views approach we >> just >>>>>>>>>> register all indexes with their caollation and let the cost >> model do >>>>>>>> its >>>>>>>>>> job. Our rules are very simple. The complexity is encapsulated >> in the >>>>>>>>>> index scan cost estimation. >>>>>>>>>> >>>>>>>>>> As for your example with disjunctive predicate: >>>>>>>>>> >>>>>>>>>>> SELECT * FROM foo WHERE a > 100 or b < 1000; >>>>>>>>>> >>>>>>>>>> I think with materialized views approach we can do a complex >> logic as >>>>>>>>>> well. For example, we may implement a special rule for such >> cases: >>>>>>>>>> BitmapOrRule. Or, even better, we will rewrite such predicates >> to a >>>>>>>>>> UNION ALL clause. A very good explanation and a comparison of >> both >>>>>>>>>> approaches I've found in this Oracle's blog post [1]. >>>>>>>>>> >>>>>>>>>> As a conclusion: choosing between materialized views approach >> and >>>>>>>>>> IndexRule-based approach is a trade-off between complexity of >>>>>>>>>> implementation and search space pollution. I hope, when the >>>>>>>>>> Cascades-style optimizer become a part of Calcite, the search >> space >>>>>>>>>> pollution will not be a serious issue anymore. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >>>>>>>>> >>>>>>>> >> https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Kind Regards >>>>>>>>>> Roman Kondakov >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 31.05.2020 07:31, Haisheng Yuan wrote: >>>>>>>>>>> Thanks Julian and Roman for sharing the experiences for >> modeling >>>>>>>>> indexes. >>>>>>>>>>> >>>>>>>>>>> Besides using materialized views, which is already proven by >> Phoenix >>>>>>>>> and Ignite, there is another approach, as mentioned by Vladimir, >> define >>>>>>>>> your own rules and indexscan operators. >>>>>>>>>>> >>>>>>>>>>> FilterTableScan2IndexScanRule and its variances, which match >> Filter >>>>>>>>> over TableScan, create an IndexScan if the table has >> corresponding index >>>>>>>> on >>>>>>>>> the filter column. You don't have to register different >> tablescans for >>>>>>>> all >>>>>>>>> the indexes, in case you have nearly thousands indexes for a >> table, say >>>>>>>>> 999, which is allowed in SQL Server 2016. It can also support >> more >>>>>>>> complex >>>>>>>>> scenario, e.g.: >>>>>>>>>>> >>>>>>>>>>> SELECT * FROM foo WHERE a > 100 or b < 1000; >>>>>>>>>>> If there is an index on column a, and another index on column >> b, we >>>>>>>>> may need to utilize both indexes, through Bitmap TableScan. >>>>>>>>>>> >>>>>>>>>>> Bitmap Table Scan on foo >>>>>>>>>>> -> BitmapOR >>>>>>>>>>> -> Bitmap Index Scan on foo_index_a >>>>>>>>>>> Index Condition: (a > 100) >>>>>>>>>>> -> Bitmap Index Scan on foo_index_b >>>>>>>>>>> Index Condition: (b < 1000) >>>>>>>>>>> >>>>>>>>>>> But still, this approach requires some non-trivial work to do. >>>>>>>>>>> >>>>>>>>>>> Hi Roman, I believe you definitely have consulted both >> approaches for >>>>>>>>> Apache Ignite to work with indexes, you decided to go with >> materialized >>>>>>>>> views, there are some reasons and tradeoffs to consider behind >> your >>>>>>>>> decision. Do you mind sharing with us? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Haisheng >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 2020/05/29 17:34:27, Julian Hyde <jh...@apache.org> wrote: >>>>>>>>>>>> Materialized views are not a hack, as Vladimir claims. >> Materialized >>>>>>>>> views are a fundamental concept in relational algebra, and they >> are an >>>>>>>>> elegant way - in my opinion the correct way - to model indexes >> (and many >>>>>>>>> other structures). >>>>>>>>>>>> >>>>>>>>>>>> In Calcite materialized views are a feature in the planner >> that >>>>>>>>> allows you to declare a table as equivalent to a query. (You do >> not need >>>>>>>> to >>>>>>>>> issue a CREATE MATERIALIZED VIEW statement. They are internal.) >> Then you >>>>>>>>> can use algebra to substitute one for the other, and apply >> cost-based >>>>>>>>> optimization to choose between plans with & without the >> materialized >>>>>>>> view. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Other people have used this approach in the past. Search this >> list >>>>>>>>> and you should find discussions. Also I have a talk with Maryann >> Xue >>>>>>>> about >>>>>>>>> planning for indexes in Apache Phoenix[1]. >>>>>>>>>>>> >>>>>>>>>>>> Julian >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>> >>>>>>>> >> https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite >>>>>>>>> < >>>>>>>>> >>>>>>>> >> https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On May 29, 2020, at 5:28 AM, Roman Kondakov >>>>>>>>> <kondako...@mail.ru.INVALID> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Tim, >>>>>>>>>>>>> >>>>>>>>>>>>> In Apache Ignite we've already faced this challenge. We >> solved it >>>>>>>>> using >>>>>>>>>>>>> materialized views and FilterableTable. Let's consider your >>>>>>>> example: >>>>>>>>>>>>> >>>>>>>>>>>>>> select * from users where country='UK' and >>>>>>>> some_other_column='foo'; >>>>>>>>>>>>> >>>>>>>>>>>>> with a primary index and a sorted secondary index (B+Tree?) >> over >>>>>>>> the >>>>>>>>>>>>> 'country' field. >>>>>>>>>>>>> >>>>>>>>>>>>> As a first step Ignite registers all indexes that might be >> helpful >>>>>>>>> for >>>>>>>>>>>>> the query plan in VolcanoPlanner as materialized views. For >> now we >>>>>>>>>>>>> register all indexes in all tables that participate in the >> query. >>>>>>>>>>>>> Registering all indexes might be excessive, but maybe we >> will apply >>>>>>>>> some >>>>>>>>>>>>> pruning later. So it's ok for now to register all indexes. >>>>>>>>>>>>> >>>>>>>>>>>>> Index materialized view is very simple: it's just a sorted >> table >>>>>>>>> scan: >>>>>>>>>>>>> >>>>>>>>>>>>> 'SELECT * from tbl ORDER BY idx_fields' >>>>>>>>>>>>> >>>>>>>>>>>>> After registering indexes as materialized views, the >> optimizer's >>>>>>>>> search >>>>>>>>>>>>> space will look like this: >>>>>>>>>>>>> >>>>>>>>>>>>> Project([*]) >>>>>>>>>>>>> Filter[country='UK' and some_other_column='foo'] >>>>>>>>>>>>> Set0 >>>>>>>>>>>>> >>>>>>>>>>>>> where Set0 consists of table and index scans: >>>>>>>>>>>>> >>>>>>>>>>>>> TableScan('tbl', collation=[]) >>>>>>>>>>>>> IndexScan('PK', collation=[PK_field]) >>>>>>>>>>>>> IndexScan('country_idx', collation=[country]) >>>>>>>>>>>>> >>>>>>>>>>>>> At this step we have our index scans registered in the >> optimizer >>>>>>>>> within >>>>>>>>>>>>> the same equivalence set as a TableScan. >>>>>>>>>>>>> >>>>>>>>>>>>> The next step is a pushing filters down to the scans. We do >> it >>>>>>>> with a >>>>>>>>>>>>> rule which is similar to 'FilterTableScanRule'. After >> applying this >>>>>>>>> rule >>>>>>>>>>>>> we have a search space that is looking like this: >>>>>>>>>>>>> >>>>>>>>>>>>> Project([*]) >>>>>>>>>>>>> TableScan('tbl', collation=[], filter=[country='UK' and >>>>>>>>>>>>> some_other_column='foo']) >>>>>>>>>>>>> IndexScan('PK', collation=[PK_field], filter=[country='UK' >> and >>>>>>>>>>>>> some_other_column='foo']) >>>>>>>>>>>>> IndexScan('country_idx', collation=[country], >> filter=[country='UK' >>>>>>>>> and >>>>>>>>>>>>> some_other_column='foo']) >>>>>>>>>>>>> >>>>>>>>>>>>> And the final step is adjusting the cost model to make it >> select >>>>>>>> the >>>>>>>>>>>>> scan with the lower cost which depends on the filter >> conditions >>>>>>>>> within >>>>>>>>>>>>> the Scan. For example, full table scan with filters >>>>>>>>>>>>> >>>>>>>>>>>>> TableScan('tbl', collation=[], filter=[country='UK' and >>>>>>>>>>>>> some_other_column='foo']) >>>>>>>>>>>>> >>>>>>>>>>>>> will cost, say, 100. Because it have to scan all rows and >> then >>>>>>>> filter >>>>>>>>>>>>> out some set of them. On the other hand the index scan that >> can do >>>>>>>>> index >>>>>>>>>>>>> lookup instead of full scan will have a less cost. For >> example >>>>>>>>>>>>> >>>>>>>>>>>>> IndexScan('country_idx', collation=[country], >> filter=[country='UK' >>>>>>>>> and >>>>>>>>>>>>> some_other_column='foo']) >>>>>>>>>>>>> >>>>>>>>>>>>> will have cost about ~10, or so because it has a good index >>>>>>>> condition >>>>>>>>>>>>> `country='UK'` which can be used for index lookup that that >> returns >>>>>>>>> only >>>>>>>>>>>>> 10% of rows. And therefore this IndexScan should be chosen >> as the >>>>>>>>> best >>>>>>>>>>>>> plan by the optimizer. >>>>>>>>>>>>> >>>>>>>>>>>>> We've recently implemented this approach in Apache Ignite >> and it >>>>>>>>> works >>>>>>>>>>>>> well for us. You can find it in [1]. This PR has many >> changes that >>>>>>>>> are >>>>>>>>>>>>> unrelated to the main topic. So particularly you can look at >>>>>>>>>>>>> `IgnitePlanner.materializations()' method which registers >> indexes >>>>>>>> as >>>>>>>>>>>>> materialized views and `IgniteTableScan` which performs >> filter >>>>>>>>>>>>> conditions assessment and index scan cost estimation. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [1] https://github.com/apache/ignite/pull/7813 >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Kind Regards >>>>>>>>>>>>> Roman Kondakov >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 29.05.2020 11:44, Tim Fox wrote: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm building a query engine with Calcite - really enjoying >> working >>>>>>>>> with >>>>>>>>>>>>>> Calcite so far! >>>>>>>>>>>>>> >>>>>>>>>>>>>> When creating a plan, it seems Calcite always creates a >> plan where >>>>>>>>> the >>>>>>>>>>>>>> sources are table scans, however in my implementation the >> tables >>>>>>>>> can have >>>>>>>>>>>>>> indexes on them so a table scan is not always the right >> choice. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I was wondering if there was any way of making Calcite >> "index >>>>>>>>> aware" - e.g. >>>>>>>>>>>>>> perhaps providing hints to the table scan instance that, >> actually, >>>>>>>>> an index >>>>>>>>>>>>>> scan or a primary key lookup should be used instead of >> actually >>>>>>>>> scanning >>>>>>>>>>>>>> the table. E.g. On the table meta-data if we provided >> information >>>>>>>>> about any >>>>>>>>>>>>>> indexes on the table, then Calcite could figure out what >> parts of >>>>>>>>> the query >>>>>>>>>>>>>> to push to the table scan and which to keep in the rest of >> the >>>>>>>> plan. >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are two specific cases I really care about: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Queries that contain a primary key lookup: >>>>>>>>>>>>>> >>>>>>>>>>>>>> select * from some_table where key_column=23 AND >>>>>>>>> some_other_column='foo'; >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the above case the 'select * from some_table where >>>>>>>>> key_column=23' can be >>>>>>>>>>>>>> implemented as a simple PK lookup in the source table, not >>>>>>>>> requiring a >>>>>>>>>>>>>> scan, thus leaving just the filter corresponding to >>>>>>>>>>>>>> 'some_other_column='foo'' in the rest of the plan >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. Queries with expressions on a column which has a >> secondary >>>>>>>> index >>>>>>>>>>>>>> >>>>>>>>>>>>>> select * from users where country='UK' and >>>>>>>> some_other_column='foo'; >>>>>>>>>>>>>> >>>>>>>>>>>>>> We have many users, and let's say 10% of them are from UK >> (still a >>>>>>>>> lot). We >>>>>>>>>>>>>> have a secondary index in the country column in the source >> table >>>>>>>> so >>>>>>>>> we can >>>>>>>>>>>>>> do an efficient index scan to retrieve the matching records. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I found this document >>>>>>>>>>>>>> https://calcite.apache.org/docs/materialized_views.html >> which >>>>>>>>> seems like it >>>>>>>>>>>>>> might help me in some way. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The idea being if I can think of my indexes as materialized >> views >>>>>>>>> then the >>>>>>>>>>>>>> query can be written against those materialized views as >> sources >>>>>>>>> instead of >>>>>>>>>>>>>> the original table sources. There appears to be a rule >>>>>>>>>>>>>> 'MaterializedViewRule' that does this already (?). >>>>>>>>>>>>>> >>>>>>>>>>>>>> This seems to get me a bit further, however, for this >> approach to >>>>>>>>> work, it >>>>>>>>>>>>>> seems I would have to create materialized views >> _dynamically_ >>>>>>>> during >>>>>>>>>>>>>> evaluation of the query, register them, rewrite the query, >> execute >>>>>>>>> it, then >>>>>>>>>>>>>> deregister the materialized view. >>>>>>>>>>>>>> >>>>>>>>>>>>>> E.g. for the primary key lookup example above, for the >> following >>>>>>>>> query: >>>>>>>>>>>>>> >>>>>>>>>>>>>> select * from some_table where key_column=23 AND >>>>>>>>> some_other_column='foo'; >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would need to dynamically create a materialized view >>>>>>>>> corresponding to: >>>>>>>>>>>>>> >>>>>>>>>>>>>> select * from some_table where key_column=23 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Then rewrite the query using MaterializedViewRule. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the general case, in order to figure out what >> materialized >>>>>>>> views >>>>>>>>> I need >>>>>>>>>>>>>> to dynamically create I would need to examine the query, >> figure >>>>>>>> out >>>>>>>>> which >>>>>>>>>>>>>> columns in expressions have indexes on them and from them >> work out >>>>>>>>> the best >>>>>>>>>>>>>> materialized view to create based on that information. This >> seems >>>>>>>>> non >>>>>>>>>>>>>> trivial. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Does anyone have any suggestions or pointers for how to >> implement >>>>>>>>> this kind >>>>>>>>>>>>>> of thing? I suspect I'm not the first person to have tried >> to do >>>>>>>>> this, as >>>>>>>>>>>>>> using indexes on tables seems a pretty common thing in many >>>>>>>> systems >>>>>>>>> (?) >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Best regards, >>>>>>>> >>>>>>>> Xu >>>>>>>> >>>>>>> >>>>> >>> >> >