Re: Using indexes rather than table scans with Calcite

2020-06-03 Thread Roman Kondakov
Hi Haisheng,

> If I didn't get it wrong, joining with scan_A means it has to read all the 
> tuples from table A, this will make index meaningless, because the purpose of 
> using index in that example is to avoid reading all the tuples from table A.

I meant 'rowId' as a physical address (ctid in PG or rowId in Oracle).
We can combine those rowIds in some way (by analogy with bitmap index)
and then use output rowids to fetch only those rows which we actually
need. Logically it is equivalent to join and can be implemented as a
nested loop join: we can iterate over the resulted rowids and fetch
corresponding rows from the row store.

> Perhaps you have different definition of Join, but the Join operator in 
> Calcite doesn't have correlation variable, only Correlate operator has, which 
> is Calcite's version of Apply. You still need the rule to transform Join to 
> Correlate, aka, Apply. In Calcite, the rule is JoinToCorrelateRule. However I 
> don't recommend using JoinToCorrelateRule, because it will transform Join to 
> Correlate unconditionally, in case of multi-way joins with join reordering 
> enabled, it will generate a lot of useless alternatives, unless you want to 
> support multi-level correlation, like:

We are going to use Correlate, I just skipped this detail in the
previous mail. You are right that it can inflate the search space. May
be we'll need to think of some pruning techniques later if it become a
problem. As I can see, this approach has some advantages because it
seems more flexible to me. In addition to the ability you mentioned to
use multi-level correlated join, it provides some kind of freedom
because we are not restricted to the particular pattern of
Join2IndexApply rule. In other words: we need to match Join2IndexApply
rule on some concrete pattern like

Join
  RelNode
  TableScan

But what if there is a filter or project between join and scan? We can
add specific patterns for this rule

Join
  RelNode
  Project
TableScan

Join
  RelNode
  Filter
TableScan

but still it's not enough to cover all possible cases. What if there are
some other RelNodes like aggregate there? We cannot foresee all possible
cases. With Correlation approach we just try to push down the filter
with correlation varible as much as possible. If it is possible to this
filter to reach the IndexScan, it becomes a very efficient index join
with low cost. Otherwise optimizer throws this alternative away. It can
be helpful in some queries like

SELECT * FROM
deps d
JOIN
(SELECT depId, COOUNT(*) FROM emps GROUP BY depId) AS e
ON d.depId = e.depId
WHERE d.name = "AAA"

if table deps is relatively small (affter filtering) and emps is big it
will be more efficient to do correlated nested loop join:

CorrelatedJoin(d.depId = e.depId)
  Filter(d.name = "AAA")
TableScan(deps)
  Agg(depId, COOUNT(*))
   IndexScan(emps, lookup: e.depId = d.depId[correlatedVar])

because in this case we don't have to do a full aggregation over the
'emps' table before doing a join:

HashJoin(d.depId = e.depId)
  Filter(d.name = "AAA")
TableScan(deps)
  Agg(depId, COOUNT(*)) <- full aggregation
   TableScan(emps)

So, tradeoff between rule-based approach and MV approach is classic:
space search inflation vs more plan alternatives.

Thank you!

-- 
Kind Regards
Roman Kondakov


On 03.06.2020 05:42, Haisheng Yuan wrote:
>> using this approach, bitmap indexes may be represented as set operations
>> (union, intersect) over idx_a and idx_b followed by joining with scan_A
>> using 'rowId'.
> 
> If I didn't get it wrong, joining with scan_A means it has to read all the 
> tuples from table A, this will make index meaningless, because the purpose of 
> using index in that example is to avoid reading all the tuples from table A.
> 
>> and it looks like it can be implemented without special
>> Join2IndexApplyRule: we'll use correlation variable from join condition
>> as filter condition.
> 
> Perhaps you have different definition of Join, but the Join operator in 
> Calcite doesn't have correlation variable, only Correlate operator has, which 
> is Calcite's version of Apply. You still need the rule to transform Join to 
> Correlate, aka, Apply. In Calcite, the rule is JoinToCorrelateRule. However I 
> don't recommend using JoinToCorrelateRule, because it will transform Join to 
> Correlate unconditionally, in case of multi-way joins with join reordering 
> enabled, it will generate a lot of useless alternatives, unless you want to 
> support multi-level correlation, like:
> 
> NestedLoopOuterJoin
> -> Table Scan on A
> ->  HashOuterJoin
>  Join Condition: B.Y = C.W
> -> Table Scan on B
> -> Index Scan using C_Z_IDX on C
>  Index Condition: C.Z = A.X
> 
> or
> 
> NestedLoopOuterJoin
> -> Table Scan on SmallTable1 A
> -> NestedLoopOuterJoin
> -> Table Scan on SmallTable2 B
> -> Index Scan using XYIndex on LargeTable C
> 

Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Haisheng Yuan
> using this approach, bitmap indexes may be represented as set operations
> (union, intersect) over idx_a and idx_b followed by joining with scan_A
> using 'rowId'.

If I didn't get it wrong, joining with scan_A means it has to read all the 
tuples from table A, this will make index meaningless, because the purpose of 
using index in that example is to avoid reading all the tuples from table A.

> and it looks like it can be implemented without special
> Join2IndexApplyRule: we'll use correlation variable from join condition
> as filter condition.

Perhaps you have different definition of Join, but the Join operator in Calcite 
doesn't have correlation variable, only Correlate operator has, which is 
Calcite's version of Apply. You still need the rule to transform Join to 
Correlate, aka, Apply. In Calcite, the rule is JoinToCorrelateRule. However I 
don't recommend using JoinToCorrelateRule, because it will transform Join to 
Correlate unconditionally, in case of multi-way joins with join reordering 
enabled, it will generate a lot of useless alternatives, unless you want to 
support multi-level correlation, like:

NestedLoopOuterJoin
-> Table Scan on A
->  HashOuterJoin
 Join Condition: B.Y = C.W
-> Table Scan on B
-> Index Scan using C_Z_IDX on C
 Index Condition: C.Z = A.X

or

NestedLoopOuterJoin
-> Table Scan on SmallTable1 A
-> NestedLoopOuterJoin
-> Table Scan on SmallTable2 B
-> Index Scan using XYIndex on LargeTable C
Index Condition: C.X = A.AID and C.Y = B.BID

Otherwise, the join to correlate rule that matches a join with right child as a 
tablescan (or its variants) makes more sense. Transforming a Join to Correlate 
is meaningful only when there is at least 1 index can be used in the right 
relation, so it would be perfect if you can check if there is available index 
in the right relation or not, if you care about the search space.

On 2020/06/02 20:30:41, Roman Kondakov  wrote: 
> Vladimir, Haisheng,
> 
> thank you for sharing your thoughts. And special thanks to Haisheng for
> awesome examples.
> 
> I found Julian's reasoning about representing bitmap indexes as joins
> very deep and interesting. As I understand this idea right, for table
> A(a,b,c) indexes over columns, say, 'a' and 'b' can be represented as
> relations sorted by 'a' and 'b':
> 
> idx_a('rowId', 'a') [collation=[a]]
> idx_b('rowId', 'b') [collation=[b]]
> 
> Table A itself may be represented as a relation
> 
> scan_A('rowId', 'a', 'b', 'c') [collation=[]]
> 
> using this approach, bitmap indexes may be represented as set operations
> (union, intersect) over idx_a and idx_b followed by joining with scan_A
> using 'rowId'.
> 
> Index-only scans (aka covered indexes), can be modeled as scans over
> idx_a or idx_b without consequent join with the main table.
> 
> @Julian, did I get your idea right?
> 
> If one try to apply this approach directly, then the search space will
> be quickly polluted by those index relations. But if the Lattice-like
> framework from the Julian's mail is implemented, then it will be a great
> feature, I guess.
> 
> @Haisheng, BTW we are working on the indexed nested loop join right now
> and it looks like it can be implemented without special
> Join2IndexApplyRule: we'll use correlation variable from join condition
> as filter condition. If there is no index there - it will be a simple
> correlated nested loop join (or Apply). If index is present, the cost
> will be lower because index lookup will speed up this join, and this
> index scan will win.
> 
> > 
> > There is another useful feature, that is IndexApply.
> > 
> > The rule of Join2IndexApply will transform a logical join to logical index 
> > apply, then transform to physical indexed nested loop join.
> > select * from foo join bar on foo.a = bar.f;
> > There is an index on foo.a, but no index on bar.f.
> 
> 
> Thank you!
> 
> 
> -- 
> Kind Regards
> Roman Kondakov
> 
> 
> On 02.06.2020 19:55, Julian Hyde wrote:
> 
> > Vladimir,
> > 
> > I feel the same way. MVs are more powerful and general, and with some 
> > effort they could be just as efficient as other approaches. 
> > 
> > One problem that needs to be solved is the “registration problem”. If you 
> > have a lot of MVs they all have to be registered in the planner’s search 
> > space, otherwise the planner will not find them.
> > 
> > I would like to see a benchmark for this. Say we have 1,000 MVs and we plan 
> > 10,000 queries that potentially use them. The goal is that the planning 
> > cost Is comparable to only having 1 MV. We could achieve the goal by 
> > re-using the MVs across planner instances (thereby amortizing the cost of 
> > creating them) and “index” the planner’s search space.
> > 
> > The Lattice structure essentially does this for families for join-aggregate 
> > MVs (summary tables). We need something similar for project-sort MVs 
> > (indexes). 
> > 

Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Roman Kondakov
Vladimir, Haisheng,

thank you for sharing your thoughts. And special thanks to Haisheng for
awesome examples.

I found Julian's reasoning about representing bitmap indexes as joins
very deep and interesting. As I understand this idea right, for table
A(a,b,c) indexes over columns, say, 'a' and 'b' can be represented as
relations sorted by 'a' and 'b':

idx_a('rowId', 'a') [collation=[a]]
idx_b('rowId', 'b') [collation=[b]]

Table A itself may be represented as a relation

scan_A('rowId', 'a', 'b', 'c') [collation=[]]

using this approach, bitmap indexes may be represented as set operations
(union, intersect) over idx_a and idx_b followed by joining with scan_A
using 'rowId'.

Index-only scans (aka covered indexes), can be modeled as scans over
idx_a or idx_b without consequent join with the main table.

@Julian, did I get your idea right?

If one try to apply this approach directly, then the search space will
be quickly polluted by those index relations. But if the Lattice-like
framework from the Julian's mail is implemented, then it will be a great
feature, I guess.

@Haisheng, BTW we are working on the indexed nested loop join right now
and it looks like it can be implemented without special
Join2IndexApplyRule: we'll use correlation variable from join condition
as filter condition. If there is no index there - it will be a simple
correlated nested loop join (or Apply). If index is present, the cost
will be lower because index lookup will speed up this join, and this
index scan will win.

> 
> There is another useful feature, that is IndexApply.
> 
> The rule of Join2IndexApply will transform a logical join to logical index 
> apply, then transform to physical indexed nested loop join.
> select * from foo join bar on foo.a = bar.f;
> There is an index on foo.a, but no index on bar.f.


Thank you!


-- 
Kind Regards
Roman Kondakov


On 02.06.2020 19:55, Julian Hyde wrote:

> Vladimir,
> 
> I feel the same way. MVs are more powerful and general, and with some effort 
> they could be just as efficient as other approaches. 
> 
> One problem that needs to be solved is the “registration problem”. If you 
> have a lot of MVs they all have to be registered in the planner’s search 
> space, otherwise the planner will not find them.
> 
> I would like to see a benchmark for this. Say we have 1,000 MVs and we plan 
> 10,000 queries that potentially use them. The goal is that the planning cost 
> Is comparable to only having 1 MV. We could achieve the goal by re-using the 
> MVs across planner instances (thereby amortizing the cost of creating them) 
> and “index” the planner’s search space.
> 
> The Lattice structure essentially does this for families for join-aggregate 
> MVs (summary tables). We need something similar for project-sort MVs 
> (indexes). 
> 
> Julian






On 02.06.2020 10:05, Vladimir Ozerov wrote:
> Hi Roman,
> 
> To me, there is no principal difference between materialized views and
> rule-based approaches - they are both "rules". In the materialized views
> approach the rule is to create all alternative access paths
> unconditionally, this rule is always fired first during the optimization
> process. A rule-based approach gives you flexibility. You may do exactly
> the same as with materialized views - just enumerate all access paths and
> add them to search space. But at the same time, you may develop more
> complicated things, such as pruning, bitmap index joins, etc. This is not
> possible with materialized views.
> 
> In this sense, the difference in complexity between materialized views
> approach and rule-based approach with similar features (no pruning, etc) is
> close to zero - in both cases you just iterate over existing indexes and
> add them to search space. Rule-based approach becomes more complex when you
> add more features to it.
> 
> вт, 2 июн. 2020 г. в 00:37, Haisheng Yuan :
> 
>> Hi Roman,
>>
>> There are some potential advantages.
>>
>> Let's still use this as example.
>> SELECT * FROM foo WHERE a > 100 or b < 1000;
>> Both column a and b have B-Tree indexes, note that it doesn't have to be
>> bitmap index.
>>
>>  Bitmap Table Scan on foo
>>->  BitmapOr
>>  ->  Bitmap Index Scan on foo_index_a
>>Index Cond: (a > 100)
>>  ->  Bitmap Index Scan on foo_index_b
>>Index Cond: (b < 1000)
>>
>> filter a:  
>> filter b:---
>>
>> The bitmap scan just needs to read the tuples of satisfying the condition
>> once, because the BitmapOr will eliminate duplicate tuple id. Constructing
>> to UNION ALL will read the intersecting part twice, and it will increase
>> the search space a little bit even when there is no available index.
>>
>> If we change the condition from OR to AND:
>> SELECT * FROM foo WHERE a > 100 and b < 1000;
>>
>> Will MV-based approach have to choose which index to use?
>>
>> I would prefer the following plan:
>>  Bitmap Table Scan on foo
>>->  BitmapAnd
>>  ->  

Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Julian Hyde
Vladimir,

I feel the same way. MVs are more powerful and general, and with some effort 
they could be just as efficient as other approaches. 

One problem that needs to be solved is the “registration problem”. If you have 
a lot of MVs they all have to be registered in the planner’s search space, 
otherwise the planner will not find them.

I would like to see a benchmark for this. Say we have 1,000 MVs and we plan 
10,000 queries that potentially use them. The goal is that the planning cost Is 
comparable to only having 1 MV. We could achieve the goal by re-using the MVs 
across planner instances (thereby amortizing the cost of creating them) and 
“index” the planner’s search space.

The Lattice structure essentially does this for families for join-aggregate MVs 
(summary tables). We need something similar for project-sort MVs (indexes). 

Julian

> On Jun 2, 2020, at 12:05 AM, Vladimir Ozerov  wrote:
> 


Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Vladimir Ozerov
Hi Roman,

To me, there is no principal difference between materialized views and
rule-based approaches - they are both "rules". In the materialized views
approach the rule is to create all alternative access paths
unconditionally, this rule is always fired first during the optimization
process. A rule-based approach gives you flexibility. You may do exactly
the same as with materialized views - just enumerate all access paths and
add them to search space. But at the same time, you may develop more
complicated things, such as pruning, bitmap index joins, etc. This is not
possible with materialized views.

In this sense, the difference in complexity between materialized views
approach and rule-based approach with similar features (no pruning, etc) is
close to zero - in both cases you just iterate over existing indexes and
add them to search space. Rule-based approach becomes more complex when you
add more features to it.

вт, 2 июн. 2020 г. в 00:37, Haisheng Yuan :

> Hi Roman,
>
> There are some potential advantages.
>
> Let's still use this as example.
> SELECT * FROM foo WHERE a > 100 or b < 1000;
> Both column a and b have B-Tree indexes, note that it doesn't have to be
> bitmap index.
>
>  Bitmap Table Scan on foo
>->  BitmapOr
>  ->  Bitmap Index Scan on foo_index_a
>Index Cond: (a > 100)
>  ->  Bitmap Index Scan on foo_index_b
>Index Cond: (b < 1000)
>
> filter a:  
> filter b:---
>
> The bitmap scan just needs to read the tuples of satisfying the condition
> once, because the BitmapOr will eliminate duplicate tuple id. Constructing
> to UNION ALL will read the intersecting part twice, and it will increase
> the search space a little bit even when there is no available index.
>
> If we change the condition from OR to AND:
> SELECT * FROM foo WHERE a > 100 and b < 1000;
>
> Will MV-based approach have to choose which index to use?
>
> I would prefer the following plan:
>  Bitmap Table Scan on foo
>->  BitmapAnd
>  ->  Bitmap Index Scan on foo_index_a
>Index Cond: (a > 100)
>  ->  Bitmap Index Scan on foo_index_b
>Index Cond: (b < 1000)
>
> filter a:  
> filter b: ---
>
> It would be extremely helpful when the intersecting range is very small,
> because it just need to read the intersected data page only to fetch tuples.
>
> This can applies to arbitrary combinations of AND/OR conditions.
>
> This is an attractive feature for data warehousing customers.
>
> Another example:
> SELECT max(a) from foo;
>
> The query can be transformed to
> SELECT a from foo order by a desc limit 1;
>
> If there is a index on column 'a' but with ASC order.
> The collation trait request will pass to TableScan, if there is an index
> on the collation key. We just need to transform to an IndexScan with
> reverse index scan direction, which is already possible in current Calcite.
>
> Similarly, the user query may be simple but the index on a is ASC order:
> SELECT * from foo order by a desc;
>
> MV-based approach would have to register all the tablescan again with
> reversed index scan direction.
>
> There is another useful feature, that is IndexApply.
>
> The rule of Join2IndexApply will transform a logical join to logical index
> apply, then transform to physical indexed nested loop join.
> select * from foo join bar on foo.a = bar.f;
> There is an index on foo.a, but no index on bar.f.
>
> Nested Loop Inner Join
>   Join Filter: true
>   -> TableScan on bar
>   -> Index Scan using foo_index_a on foo
>Index Cond: foo.a = bar.f
>
> MV-based approach would have to sort table bar if it want to use the index.
>
> But considering Ignite is a in-memory computing platform, MV-based
> approach would be just good enough to serve its purpose.
>
> Looking at Drill's DbScanToIndexScanPrule.java, the way Apache Drill uses
> to tackle with index is not wrong, but I would not recommend to follow
> Drill's example to do it.
>
> Thanks,
> Haisheng
>
> On 2020/06/01 19:15:32, Julian Hyde  wrote:
> > I'm pleased there are a variety of approaches. People should use
> > whichever works for them and their use case.
> >
> > The so-called "rule-based" approach is definitely useful for
> > OLTP-style queries where you are accessing a few rows and you need to
> > plan quickly.
> >
> > Haisheng mentioned bitmap indexes a while back. Optimizing queries to
> > use bitmap indexes is interesting because they can be combined in more
> > ways than regular indexes. In this case, and many others, it is worth
> > thinking about the problem at a high level. For example, combining two
> > bitmap indexes can be modeled as a join, where the join keys are the
> > record ids, and the record ids are sorted within each index value.
> > Thinking at the high level can find plans that the rule-based approach
> > will never find.
> >
> > Indexes-as-MVs, 

Re: Using indexes rather than table scans with Calcite

2020-06-01 Thread Haisheng Yuan
Hi Roman,

There are some potential advantages.

Let's still use this as example.
SELECT * FROM foo WHERE a > 100 or b < 1000;
Both column a and b have B-Tree indexes, note that it doesn't have to be bitmap 
index.

 Bitmap Table Scan on foo
   ->  BitmapOr
 ->  Bitmap Index Scan on foo_index_a
   Index Cond: (a > 100)
 ->  Bitmap Index Scan on foo_index_b
   Index Cond: (b < 1000)

filter a:  
filter b:---

The bitmap scan just needs to read the tuples of satisfying the condition once, 
because the BitmapOr will eliminate duplicate tuple id. Constructing to UNION 
ALL will read the intersecting part twice, and it will increase the search 
space a little bit even when there is no available index.

If we change the condition from OR to AND:
SELECT * FROM foo WHERE a > 100 and b < 1000;

Will MV-based approach have to choose which index to use?

I would prefer the following plan:
 Bitmap Table Scan on foo
   ->  BitmapAnd
 ->  Bitmap Index Scan on foo_index_a
   Index Cond: (a > 100)
 ->  Bitmap Index Scan on foo_index_b
   Index Cond: (b < 1000)

filter a:  
filter b: ---

It would be extremely helpful when the intersecting range is very small, 
because it just need to read the intersected data page only to fetch tuples.

This can applies to arbitrary combinations of AND/OR conditions.

This is an attractive feature for data warehousing customers.

Another example:
SELECT max(a) from foo;

The query can be transformed to
SELECT a from foo order by a desc limit 1;

If there is a index on column 'a' but with ASC order.
The collation trait request will pass to TableScan, if there is an index on the 
collation key. We just need to transform to an IndexScan with reverse index 
scan direction, which is already possible in current Calcite.

Similarly, the user query may be simple but the index on a is ASC order:
SELECT * from foo order by a desc;

MV-based approach would have to register all the tablescan again with reversed 
index scan direction.

There is another useful feature, that is IndexApply.

The rule of Join2IndexApply will transform a logical join to logical index 
apply, then transform to physical indexed nested loop join.
select * from foo join bar on foo.a = bar.f;
There is an index on foo.a, but no index on bar.f.

Nested Loop Inner Join
  Join Filter: true
  -> TableScan on bar
  -> Index Scan using foo_index_a on foo
   Index Cond: foo.a = bar.f

MV-based approach would have to sort table bar if it want to use the index.

But considering Ignite is a in-memory computing platform, MV-based approach 
would be just good enough to serve its purpose.

Looking at Drill's DbScanToIndexScanPrule.java, the way Apache Drill uses to 
tackle with index is not wrong, but I would not recommend to follow Drill's 
example to do it.

Thanks,
Haisheng

On 2020/06/01 19:15:32, Julian Hyde  wrote: 
> I'm pleased there are a variety of approaches. People should use
> whichever works for them and their use case.
> 
> The so-called "rule-based" approach is definitely useful for
> OLTP-style queries where you are accessing a few rows and you need to
> plan quickly.
> 
> Haisheng mentioned bitmap indexes a while back. Optimizing queries to
> use bitmap indexes is interesting because they can be combined in more
> ways than regular indexes. In this case, and many others, it is worth
> thinking about the problem at a high level. For example, combining two
> bitmap indexes can be modeled as a join, where the join keys are the
> record ids, and the record ids are sorted within each index value.
> Thinking at the high level can find plans that the rule-based approach
> will never find.
> 
> Indexes-as-MVs, indexes-as-tables, and
> index-filtered-table-scan-as-join are other examples of the high-level
> approach.
> 
> Julian
> 
> 
> 
> On Mon, Jun 1, 2020 at 12:00 PM Roman Kondakov
>  wrote:
> >
> > Hi Xiening,
> >
> > the example was synthetic. What is still not clear for me is how to
> > exploit index sortedness with a rule based approach. As far as I
> > understand with this approach we need to write complex rules (for
> > example [1]) that should decide which index is useful and which is not
> > useful. These rules should take into account both statistics and
> > collations, so they do some part of work that should be done by a cost
> > model. And it makes writing such rules quite a difficult task.
> >
> > With a materialized views approach we can register all indexes as scans,
> > push filters to them if needed. And the cost model, not a rule, will
> > decide which index is better based on its cost and output collation.
> >
> > So the benefits of rule based approach are not so obvious to me. I would
> > really appreciate if you could tell me in what cases rule-based approach
> > is better. I understand that its definitely better in 

Re: Using indexes rather than table scans with Calcite

2020-06-01 Thread Julian Hyde
I'm pleased there are a variety of approaches. People should use
whichever works for them and their use case.

The so-called "rule-based" approach is definitely useful for
OLTP-style queries where you are accessing a few rows and you need to
plan quickly.

Haisheng mentioned bitmap indexes a while back. Optimizing queries to
use bitmap indexes is interesting because they can be combined in more
ways than regular indexes. In this case, and many others, it is worth
thinking about the problem at a high level. For example, combining two
bitmap indexes can be modeled as a join, where the join keys are the
record ids, and the record ids are sorted within each index value.
Thinking at the high level can find plans that the rule-based approach
will never find.

Indexes-as-MVs, indexes-as-tables, and
index-filtered-table-scan-as-join are other examples of the high-level
approach.

Julian



On Mon, Jun 1, 2020 at 12:00 PM Roman Kondakov
 wrote:
>
> Hi Xiening,
>
> the example was synthetic. What is still not clear for me is how to
> exploit index sortedness with a rule based approach. As far as I
> understand with this approach we need to write complex rules (for
> example [1]) that should decide which index is useful and which is not
> useful. These rules should take into account both statistics and
> collations, so they do some part of work that should be done by a cost
> model. And it makes writing such rules quite a difficult task.
>
> With a materialized views approach we can register all indexes as scans,
> push filters to them if needed. And the cost model, not a rule, will
> decide which index is better based on its cost and output collation.
>
> So the benefits of rule based approach are not so obvious to me. I would
> really appreciate if you could tell me in what cases rule-based approach
> is better. I understand that its definitely better in scenarios when the
> number of indexes is very high. But may be there are some other advantages?
>
> Thank you!
>
> [1]
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/index/rules/DbScanToIndexScanPrule.java
>
> --
> Kind Regards
> Roman Kondakov
>
>
> On 01.06.2020 21:00, Xiening Dai wrote:
> > Hi Roman,
> >
> > The example you mentioned is an advanced scenario. Note that there are 
> > different types of index, such as clustered index, secondary index, covered 
> > and non-covered index. In your case, typical OLTP/OLAP optimizer would 
> > create an index-based join on top of the range table scan (or 
> > FilteredTableScan in your term). And these transformation can definitely be 
> > based on rules. But the difficult part is actually the statistics 
> > estimation and cost calculation. You could end up with higher runtime cost 
> > with index based join when join cardinality is high.
> >
> > But back to the original question, if we’d like to leverage index on table 
> > scan, I think simple rule would serve the purpose. In fact, we have 
> > FilterTableScanPredicatePushdownRule in our system which does exactly the 
> > same thing.
> >
> >> On May 31, 2020, at 12:45 PM, Roman Kondakov  
> >> wrote:
> >>
> >> Hi Vladimir,
> >>
> >> thank you for sharing your point. Could you please clarify some details
> >> with a rulse-based index selection? You said
> >>
> >>> the fundamental problem with "indexes as materialized
> >>> views" approach is that you have to register them beforehand, instead of
> >>> using them only when needed.
> >>
> >> I agree, it's kind of a problem. What is not clear for me with
> >> IndexScanRule-based approach is how to decide when and which index we
> >> need? I understand that is is pretty easy to do in the case like this:
> >>
> >> Filter
> >>  Scan
> >>
> >> we can match the IndexScanRule on this pattern and do an index lookup
> >> using filter condition. But what to do in the more complex scenarios?
> >> Let's consider an example
> >>
> >> SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100
> >>
> >> where A.a, A.c and B.b are indexed fields. The logical plan for this
> >> query might look like this:
> >>
> >> LogicalJoin(A.a=B.b)
> >>  LogicalFilter(A.c > 100)
> >>LogicalScan(A)
> >>  LogicalScan(B)
> >>
> >> as I understand (please correct me if I'm wrong), with the rule-based
> >> approach, after allpying IndexScanRule the plan will look like this:
> >>
> >> LogicalJoin(A.a=B.b)
> >>  PhysicalIndexScan(A.c, lower bound = 100)
> >>  PhysicalTableScan(B)
> >>
> >> But in this case we lose the possibility of using index scans over A.a
> >> and B.b and joining them with MergeJoin, which can be more efficient
> >> plan in terms of the cost.
> >>
> >> My question is: how rule-based approach handle this scenario? Will it
> >> re-apply IndexScanRule once again to produce PhysicalIndexScan(A.a) and
> >> PhysicalIndexScan(B.b)? Or am I missing the crucial point of a rule-base
> >> approach?
> >>
> >> Thank you in advance!
> >>
> >>
> >> --
> >> Kind Regards
> >> Roman Kondakov
> >>
> 

Re: Using indexes rather than table scans with Calcite

2020-06-01 Thread Roman Kondakov
Hi Xiening,

the example was synthetic. What is still not clear for me is how to
exploit index sortedness with a rule based approach. As far as I
understand with this approach we need to write complex rules (for
example [1]) that should decide which index is useful and which is not
useful. These rules should take into account both statistics and
collations, so they do some part of work that should be done by a cost
model. And it makes writing such rules quite a difficult task.

With a materialized views approach we can register all indexes as scans,
push filters to them if needed. And the cost model, not a rule, will
decide which index is better based on its cost and output collation.

So the benefits of rule based approach are not so obvious to me. I would
really appreciate if you could tell me in what cases rule-based approach
is better. I understand that its definitely better in scenarios when the
number of indexes is very high. But may be there are some other advantages?

Thank you!

[1]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/index/rules/DbScanToIndexScanPrule.java

-- 
Kind Regards
Roman Kondakov


On 01.06.2020 21:00, Xiening Dai wrote:
> Hi Roman,
> 
> The example you mentioned is an advanced scenario. Note that there are 
> different types of index, such as clustered index, secondary index, covered 
> and non-covered index. In your case, typical OLTP/OLAP optimizer would create 
> an index-based join on top of the range table scan (or FilteredTableScan in 
> your term). And these transformation can definitely be based on rules. But 
> the difficult part is actually the statistics estimation and cost 
> calculation. You could end up with higher runtime cost with index based join 
> when join cardinality is high.
> 
> But back to the original question, if we’d like to leverage index on table 
> scan, I think simple rule would serve the purpose. In fact, we have 
> FilterTableScanPredicatePushdownRule in our system which does exactly the 
> same thing.
> 
>> On May 31, 2020, at 12:45 PM, Roman Kondakov  
>> wrote:
>>
>> Hi Vladimir,
>>
>> thank you for sharing your point. Could you please clarify some details
>> with a rulse-based index selection? You said
>>
>>> the fundamental problem with "indexes as materialized
>>> views" approach is that you have to register them beforehand, instead of
>>> using them only when needed.
>>
>> I agree, it's kind of a problem. What is not clear for me with
>> IndexScanRule-based approach is how to decide when and which index we
>> need? I understand that is is pretty easy to do in the case like this:
>>
>> Filter
>>  Scan
>>
>> we can match the IndexScanRule on this pattern and do an index lookup
>> using filter condition. But what to do in the more complex scenarios?
>> Let's consider an example
>>
>> SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100
>>
>> where A.a, A.c and B.b are indexed fields. The logical plan for this
>> query might look like this:
>>
>> LogicalJoin(A.a=B.b)
>>  LogicalFilter(A.c > 100)
>>LogicalScan(A)
>>  LogicalScan(B)
>>
>> as I understand (please correct me if I'm wrong), with the rule-based
>> approach, after allpying IndexScanRule the plan will look like this:
>>
>> LogicalJoin(A.a=B.b)
>>  PhysicalIndexScan(A.c, lower bound = 100)
>>  PhysicalTableScan(B)
>>
>> But in this case we lose the possibility of using index scans over A.a
>> and B.b and joining them with MergeJoin, which can be more efficient
>> plan in terms of the cost.
>>
>> My question is: how rule-based approach handle this scenario? Will it
>> re-apply IndexScanRule once again to produce PhysicalIndexScan(A.a) and
>> PhysicalIndexScan(B.b)? Or am I missing the crucial point of a rule-base
>> approach?
>>
>> Thank you in advance!
>>
>>
>> -- 
>> Kind Regards
>> Roman Kondakov
>>
>>
>> On 31.05.2020 21:39, Vladimir Ozerov wrote:
>>> As already mentioned, the fundamental problem with "indexes as materialized
>>> views" approach is that you have to register them beforehand, instead of
>>> using them only when needed. On the other hand, the complexity of index
>>> planning comes from cost estimation and predicate splitting.
>>> Materializations cannot help you with that anyhow. This is why I call this
>>> approach (not materialized views per se) "hacky" - you reuse several simple
>>> parts of the Calcite infrastructure at the cost of loss in the flexibility
>>> of the planner, while the most complicated parts still need to be
>>> implemented by hand.
>>>
>>> Materialized views could be a good fit e.g for partial indexes because in
>>> this case, Calcite could help you with complex subsumption mechanics. But
>>> for standard indexes, the pros/cons balance is not that obvious.
>>>
>>> вс, 31 мая 2020 г. в 19:28, xu :
>>>
 Hi Tim,

 I am working on MySQL InnoDB adapter and trying to introduce this to
 Calcite, currently it is only in early stage, and not approved/reviewed by
 committers 

Re: Using indexes rather than table scans with Calcite

2020-06-01 Thread Xiening Dai
Hi Roman,

The example you mentioned is an advanced scenario. Note that there are 
different types of index, such as clustered index, secondary index, covered and 
non-covered index. In your case, typical OLTP/OLAP optimizer would create an 
index-based join on top of the range table scan (or FilteredTableScan in your 
term). And these transformation can definitely be based on rules. But the 
difficult part is actually the statistics estimation and cost calculation. You 
could end up with higher runtime cost with index based join when join 
cardinality is high.

But back to the original question, if we’d like to leverage index on table 
scan, I think simple rule would serve the purpose. In fact, we have 
FilterTableScanPredicatePushdownRule in our system which does exactly the same 
thing.

> On May 31, 2020, at 12:45 PM, Roman Kondakov  
> wrote:
> 
> Hi Vladimir,
> 
> thank you for sharing your point. Could you please clarify some details
> with a rulse-based index selection? You said
> 
>> the fundamental problem with "indexes as materialized
>> views" approach is that you have to register them beforehand, instead of
>> using them only when needed.
> 
> I agree, it's kind of a problem. What is not clear for me with
> IndexScanRule-based approach is how to decide when and which index we
> need? I understand that is is pretty easy to do in the case like this:
> 
> Filter
>  Scan
> 
> we can match the IndexScanRule on this pattern and do an index lookup
> using filter condition. But what to do in the more complex scenarios?
> Let's consider an example
> 
> SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100
> 
> where A.a, A.c and B.b are indexed fields. The logical plan for this
> query might look like this:
> 
> LogicalJoin(A.a=B.b)
>  LogicalFilter(A.c > 100)
>LogicalScan(A)
>  LogicalScan(B)
> 
> as I understand (please correct me if I'm wrong), with the rule-based
> approach, after allpying IndexScanRule the plan will look like this:
> 
> LogicalJoin(A.a=B.b)
>  PhysicalIndexScan(A.c, lower bound = 100)
>  PhysicalTableScan(B)
> 
> But in this case we lose the possibility of using index scans over A.a
> and B.b and joining them with MergeJoin, which can be more efficient
> plan in terms of the cost.
> 
> My question is: how rule-based approach handle this scenario? Will it
> re-apply IndexScanRule once again to produce PhysicalIndexScan(A.a) and
> PhysicalIndexScan(B.b)? Or am I missing the crucial point of a rule-base
> approach?
> 
> Thank you in advance!
> 
> 
> -- 
> Kind Regards
> Roman Kondakov
> 
> 
> On 31.05.2020 21:39, Vladimir Ozerov wrote:
>> As already mentioned, the fundamental problem with "indexes as materialized
>> views" approach is that you have to register them beforehand, instead of
>> using them only when needed. On the other hand, the complexity of index
>> planning comes from cost estimation and predicate splitting.
>> Materializations cannot help you with that anyhow. This is why I call this
>> approach (not materialized views per se) "hacky" - you reuse several simple
>> parts of the Calcite infrastructure at the cost of loss in the flexibility
>> of the planner, while the most complicated parts still need to be
>> implemented by hand.
>> 
>> Materialized views could be a good fit e.g for partial indexes because in
>> this case, Calcite could help you with complex subsumption mechanics. But
>> for standard indexes, the pros/cons balance is not that obvious.
>> 
>> вс, 31 мая 2020 г. в 19:28, xu :
>> 
>>> Hi Tim,
>>> 
>>> I am working on MySQL InnoDB adapter and trying to introduce this to
>>> Calcite, currently it is only in early stage, and not approved/reviewed by
>>> committers yet. Anyway, we are facing the same problem like what index to
>>> use, how to push down order by operation, etc. I have developed a simple
>>> rule based adapter to be "index aware" and being able to leverage a MySQL
>>> InnoDB storage engine written in Java. Hope this will help you to explore
>>> more options.
>>> 
>>> https://issues.apache.org/jira/browse/CALCITE-4034
>>> 
>>> Thanks,
>>> Xu
>>> 
>>> Haisheng Yuan  于2020年5月31日周日 下午10:06写道:
>>> 
 Hi Roman,
 
 Thank you for sharing your thoughts.
 
> It can be very tricky because the rule should consider not
> only filters, but also collations. This leads to increasing the
> complexity of such rules.
 
 Logical transformation rules like FilterTableScan2IndexTableScanRule
 should not consider physical properties, like collation, distribution.
>>> You
 forgot that we just reached consensus in CALCITE-3972. :)
 
 Regarding the 2nd option that uses index b, it is indeed not that easy
>>> for
 Calcite 1.22.0. In latest version, it now becomes possible. After rule
 transformation, during top-down trait request, the collation is passed
 through Filter, down to physical TableScan, which accepts the trait
>>> request
 with collation on b, find there is an index on b, and return a new
>>> 

Re: Using indexes rather than table scans with Calcite

2020-05-31 Thread Vladimir Ozerov
Hi Roman,

This heavily depends on the architecture of the planner. In Hazelcast we
have separate logical and physical phases. The goal of logical phase is
normalization of a relational tree. In this case your example is converted
to:

LogicalJoin
  LogicalConstrainedScan(A, c>100)
  LogicalScan(B)

Next, during physical planning, different implementations are considered.
For example, if for table A there are many indexes - sorted(a), hash(b),
sorted(c) - then It is possible to prune unnecessary access methods. E.g.,
hash(b) is not considered because it doesn’t add any interesting physical
property, and is costlier than other methods. At the same time, sorted(a)
is considered still, even though it has higher cost than sorted(c), because
it provides an interesting collation.

That is the key difference to materialized views - indexes are considered
as needed.

вс, 31 мая 2020 г. в 22:46, Roman Kondakov :

> Hi Vladimir,
>
> thank you for sharing your point. Could you please clarify some details
> with a rulse-based index selection? You said
>
> > the fundamental problem with "indexes as materialized
> > views" approach is that you have to register them beforehand, instead of
> > using them only when needed.
>
> I agree, it's kind of a problem. What is not clear for me with
> IndexScanRule-based approach is how to decide when and which index we
> need? I understand that is is pretty easy to do in the case like this:
>
> Filter
>   Scan
>
> we can match the IndexScanRule on this pattern and do an index lookup
> using filter condition. But what to do in the more complex scenarios?
> Let's consider an example
>
> SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100
>
> where A.a, A.c and B.b are indexed fields. The logical plan for this
> query might look like this:
>
> LogicalJoin(A.a=B.b)
>   LogicalFilter(A.c > 100)
> LogicalScan(A)
>   LogicalScan(B)
>
> as I understand (please correct me if I'm wrong), with the rule-based
> approach, after allpying IndexScanRule the plan will look like this:
>
> LogicalJoin(A.a=B.b)
>   PhysicalIndexScan(A.c, lower bound = 100)
>   PhysicalTableScan(B)
>
> But in this case we lose the possibility of using index scans over A.a
> and B.b and joining them with MergeJoin, which can be more efficient
> plan in terms of the cost.
>
> My question is: how rule-based approach handle this scenario? Will it
> re-apply IndexScanRule once again to produce PhysicalIndexScan(A.a) and
> PhysicalIndexScan(B.b)? Or am I missing the crucial point of a rule-base
> approach?
>
> Thank you in advance!
>
>
> --
> Kind Regards
> Roman Kondakov
>
>
> On 31.05.2020 21:39, Vladimir Ozerov wrote:
> > As already mentioned, the fundamental problem with "indexes as
> materialized
> > views" approach is that you have to register them beforehand, instead of
> > using them only when needed. On the other hand, the complexity of index
> > planning comes from cost estimation and predicate splitting.
> > Materializations cannot help you with that anyhow. This is why I call
> this
> > approach (not materialized views per se) "hacky" - you reuse several
> simple
> > parts of the Calcite infrastructure at the cost of loss in the
> flexibility
> > of the planner, while the most complicated parts still need to be
> > implemented by hand.
> >
> > Materialized views could be a good fit e.g for partial indexes because in
> > this case, Calcite could help you with complex subsumption mechanics. But
> > for standard indexes, the pros/cons balance is not that obvious.
> >
> > вс, 31 мая 2020 г. в 19:28, xu :
> >
> >> Hi Tim,
> >>
> >> I am working on MySQL InnoDB adapter and trying to introduce this to
> >> Calcite, currently it is only in early stage, and not approved/reviewed
> by
> >> committers yet. Anyway, we are facing the same problem like what index
> to
> >> use, how to push down order by operation, etc. I have developed a simple
> >> rule based adapter to be "index aware" and being able to leverage a
> MySQL
> >> InnoDB storage engine written in Java. Hope this will help you to
> explore
> >> more options.
> >>
> >> https://issues.apache.org/jira/browse/CALCITE-4034
> >>
> >> Thanks,
> >> Xu
> >>
> >> Haisheng Yuan  于2020年5月31日周日 下午10:06写道:
> >>
> >>> Hi Roman,
> >>>
> >>> Thank you for sharing your thoughts.
> >>>
>  It can be very tricky because the rule should consider not
>  only filters, but also collations. This leads to increasing the
>  complexity of such rules.
> >>>
> >>> Logical transformation rules like FilterTableScan2IndexTableScanRule
> >>> should not consider physical properties, like collation, distribution.
> >> You
> >>> forgot that we just reached consensus in CALCITE-3972. :)
> >>>
> >>> Regarding the 2nd option that uses index b, it is indeed not that easy
> >> for
> >>> Calcite 1.22.0. In latest version, it now becomes possible. After rule
> >>> transformation, during top-down trait request, the collation is passed
> >>> through Filter, down to physical TableScan, which 

Re: Using indexes rather than table scans with Calcite

2020-05-31 Thread Roman Kondakov
Hi Vladimir,

thank you for sharing your point. Could you please clarify some details
with a rulse-based index selection? You said

> the fundamental problem with "indexes as materialized
> views" approach is that you have to register them beforehand, instead of
> using them only when needed.

I agree, it's kind of a problem. What is not clear for me with
IndexScanRule-based approach is how to decide when and which index we
need? I understand that is is pretty easy to do in the case like this:

Filter
  Scan

we can match the IndexScanRule on this pattern and do an index lookup
using filter condition. But what to do in the more complex scenarios?
Let's consider an example

SELECT * FROM A JOIN B ON A.a=B.b WHERE A.c > 100

where A.a, A.c and B.b are indexed fields. The logical plan for this
query might look like this:

LogicalJoin(A.a=B.b)
  LogicalFilter(A.c > 100)
LogicalScan(A)
  LogicalScan(B)

as I understand (please correct me if I'm wrong), with the rule-based
approach, after allpying IndexScanRule the plan will look like this:

LogicalJoin(A.a=B.b)
  PhysicalIndexScan(A.c, lower bound = 100)
  PhysicalTableScan(B)

But in this case we lose the possibility of using index scans over A.a
and B.b and joining them with MergeJoin, which can be more efficient
plan in terms of the cost.

My question is: how rule-based approach handle this scenario? Will it
re-apply IndexScanRule once again to produce PhysicalIndexScan(A.a) and
PhysicalIndexScan(B.b)? Or am I missing the crucial point of a rule-base
approach?

Thank you in advance!


-- 
Kind Regards
Roman Kondakov


On 31.05.2020 21:39, Vladimir Ozerov wrote:
> As already mentioned, the fundamental problem with "indexes as materialized
> views" approach is that you have to register them beforehand, instead of
> using them only when needed. On the other hand, the complexity of index
> planning comes from cost estimation and predicate splitting.
> Materializations cannot help you with that anyhow. This is why I call this
> approach (not materialized views per se) "hacky" - you reuse several simple
> parts of the Calcite infrastructure at the cost of loss in the flexibility
> of the planner, while the most complicated parts still need to be
> implemented by hand.
> 
> Materialized views could be a good fit e.g for partial indexes because in
> this case, Calcite could help you with complex subsumption mechanics. But
> for standard indexes, the pros/cons balance is not that obvious.
> 
> вс, 31 мая 2020 г. в 19:28, xu :
> 
>> Hi Tim,
>>
>> I am working on MySQL InnoDB adapter and trying to introduce this to
>> Calcite, currently it is only in early stage, and not approved/reviewed by
>> committers yet. Anyway, we are facing the same problem like what index to
>> use, how to push down order by operation, etc. I have developed a simple
>> rule based adapter to be "index aware" and being able to leverage a MySQL
>> InnoDB storage engine written in Java. Hope this will help you to explore
>> more options.
>>
>> https://issues.apache.org/jira/browse/CALCITE-4034
>>
>> Thanks,
>> Xu
>>
>> Haisheng Yuan  于2020年5月31日周日 下午10:06写道:
>>
>>> Hi Roman,
>>>
>>> Thank you for sharing your thoughts.
>>>
 It can be very tricky because the rule should consider not
 only filters, but also collations. This leads to increasing the
 complexity of such rules.
>>>
>>> Logical transformation rules like FilterTableScan2IndexTableScanRule
>>> should not consider physical properties, like collation, distribution.
>> You
>>> forgot that we just reached consensus in CALCITE-3972. :)
>>>
>>> Regarding the 2nd option that uses index b, it is indeed not that easy
>> for
>>> Calcite 1.22.0. In latest version, it now becomes possible. After rule
>>> transformation, during top-down trait request, the collation is passed
>>> through Filter, down to physical TableScan, which accepts the trait
>> request
>>> with collation on b, find there is an index on b, and return a new
>> RelNode
>>> IndexScan. This process can be done in
>>> EnumerableTableScan#passThrough(required).
>>>
 I hope, when the Cascades-style optimizer become a part of Calcite, the
>>> search space
 pollution will not be a serious issue anymore.
>>>
>>> I hope so too. Top-down style can help alleviate space pollution by space
>>> pruning. But the space pollution caused by LogicalProject operator and
>> its
>>> related rules still can't be avoided. :)
>>>
>>> Again, thanks for sharing your experience with us.
>>>
>>> Haisheng
>>>
>>> On 2020/05/31 09:58:36, Roman Kondakov 
>>> wrote:
 Hi Haisheng,

 The basic rationale behind the using materialized views for secondary
 index representation instead of special rules like mentioned
 FilterTableScan2IndexTableScanRule is the simplicity of implementation.

 You are absolutely right that materialized views approach has an
>> obvious
 drawback that it should register all indexes as materialized views in
 the optimizer's search 

Re: Using indexes rather than table scans with Calcite

2020-05-31 Thread Vladimir Ozerov
As already mentioned, the fundamental problem with "indexes as materialized
views" approach is that you have to register them beforehand, instead of
using them only when needed. On the other hand, the complexity of index
planning comes from cost estimation and predicate splitting.
Materializations cannot help you with that anyhow. This is why I call this
approach (not materialized views per se) "hacky" - you reuse several simple
parts of the Calcite infrastructure at the cost of loss in the flexibility
of the planner, while the most complicated parts still need to be
implemented by hand.

Materialized views could be a good fit e.g for partial indexes because in
this case, Calcite could help you with complex subsumption mechanics. But
for standard indexes, the pros/cons balance is not that obvious.

вс, 31 мая 2020 г. в 19:28, xu :

> Hi Tim,
>
> I am working on MySQL InnoDB adapter and trying to introduce this to
> Calcite, currently it is only in early stage, and not approved/reviewed by
> committers yet. Anyway, we are facing the same problem like what index to
> use, how to push down order by operation, etc. I have developed a simple
> rule based adapter to be "index aware" and being able to leverage a MySQL
> InnoDB storage engine written in Java. Hope this will help you to explore
> more options.
>
> https://issues.apache.org/jira/browse/CALCITE-4034
>
> Thanks,
> Xu
>
> Haisheng Yuan  于2020年5月31日周日 下午10:06写道:
>
> > Hi Roman,
> >
> > Thank you for sharing your thoughts.
> >
> > > It can be very tricky because the rule should consider not
> > > only filters, but also collations. This leads to increasing the
> > > complexity of such rules.
> >
> > Logical transformation rules like FilterTableScan2IndexTableScanRule
> > should not consider physical properties, like collation, distribution.
> You
> > forgot that we just reached consensus in CALCITE-3972. :)
> >
> > Regarding the 2nd option that uses index b, it is indeed not that easy
> for
> > Calcite 1.22.0. In latest version, it now becomes possible. After rule
> > transformation, during top-down trait request, the collation is passed
> > through Filter, down to physical TableScan, which accepts the trait
> request
> > with collation on b, find there is an index on b, and return a new
> RelNode
> > IndexScan. This process can be done in
> > EnumerableTableScan#passThrough(required).
> >
> > > I hope, when the Cascades-style optimizer become a part of Calcite, the
> > search space
> > > pollution will not be a serious issue anymore.
> >
> > I hope so too. Top-down style can help alleviate space pollution by space
> > pruning. But the space pollution caused by LogicalProject operator and
> its
> > related rules still can't be avoided. :)
> >
> > Again, thanks for sharing your experience with us.
> >
> > Haisheng
> >
> > On 2020/05/31 09:58:36, Roman Kondakov 
> > wrote:
> > > Hi Haisheng,
> > >
> > > The basic rationale behind the using materialized views for secondary
> > > index representation instead of special rules like mentioned
> > > FilterTableScan2IndexTableScanRule is the simplicity of implementation.
> > >
> > > You are absolutely right that materialized views approach has an
> obvious
> > > drawback that it should register all indexes as materialized views in
> > > the optimizer's search space. But we expect our users will not overuse
> > > indexes because in general having too many indexes is a bad practice:
> on
> > > each table update we also should update it's indexes and it can cause
> > > some performance degradation of the system as a whole, not only
> > > optimizer. So we expect the number of indexes will be relatively small.
> > >
> > > Ignite uses indexes not only for index lookups, but also for exploiting
> > > it's sortedness. In this case materialized view's approach can show
> some
> > > advantages. Let's consider the example:
> > >
> > > SELECT * FROM foo WHERE a > 100 ORDER BY b;
> > >
> > > where both fields 'a' and 'b' are indexed. In this case we will have
> two
> > > alternatives of the query execution:
> > >
> > > 1. Use index 'a' for index conditioned scan and then sort rows by 'b'
> > > 2. Use index scan over 'b' and then apply filter over 'a' - here we can
> > > avoid sorting, because index over 'b' is already sorted.
> > >
> > > If I understand the approach with FilterTableScan2IndexTableScanRule
> > > correctly, at this step the rule should make a decision about which
> > > index to use. It can be very tricky because the rule should consider
> not
> > > only filters, but also collations. This leads to increasing the
> > > complexity of such rules. With materialized views approach we just
> > > register all indexes with their caollation and let the cost model do
> its
> > > job. Our rules are very simple. The complexity is encapsulated in the
> > > index scan cost estimation.
> > >
> > > As for your example with disjunctive predicate:
> > >
> > > > SELECT * FROM foo WHERE a > 100 or b < 1000;
> > >
> > > I think with 

Re: Using indexes rather than table scans with Calcite

2020-05-31 Thread xu
Hi Tim,

I am working on MySQL InnoDB adapter and trying to introduce this to
Calcite, currently it is only in early stage, and not approved/reviewed by
committers yet. Anyway, we are facing the same problem like what index to
use, how to push down order by operation, etc. I have developed a simple
rule based adapter to be "index aware" and being able to leverage a MySQL
InnoDB storage engine written in Java. Hope this will help you to explore
more options.

https://issues.apache.org/jira/browse/CALCITE-4034

Thanks,
Xu

Haisheng Yuan  于2020年5月31日周日 下午10:06写道:

> Hi Roman,
>
> Thank you for sharing your thoughts.
>
> > It can be very tricky because the rule should consider not
> > only filters, but also collations. This leads to increasing the
> > complexity of such rules.
>
> Logical transformation rules like FilterTableScan2IndexTableScanRule
> should not consider physical properties, like collation, distribution. You
> forgot that we just reached consensus in CALCITE-3972. :)
>
> Regarding the 2nd option that uses index b, it is indeed not that easy for
> Calcite 1.22.0. In latest version, it now becomes possible. After rule
> transformation, during top-down trait request, the collation is passed
> through Filter, down to physical TableScan, which accepts the trait request
> with collation on b, find there is an index on b, and return a new RelNode
> IndexScan. This process can be done in
> EnumerableTableScan#passThrough(required).
>
> > I hope, when the Cascades-style optimizer become a part of Calcite, the
> search space
> > pollution will not be a serious issue anymore.
>
> I hope so too. Top-down style can help alleviate space pollution by space
> pruning. But the space pollution caused by LogicalProject operator and its
> related rules still can't be avoided. :)
>
> Again, thanks for sharing your experience with us.
>
> Haisheng
>
> On 2020/05/31 09:58:36, Roman Kondakov 
> wrote:
> > Hi Haisheng,
> >
> > The basic rationale behind the using materialized views for secondary
> > index representation instead of special rules like mentioned
> > FilterTableScan2IndexTableScanRule is the simplicity of implementation.
> >
> > You are absolutely right that materialized views approach has an obvious
> > drawback that it should register all indexes as materialized views in
> > the optimizer's search space. But we expect our users will not overuse
> > indexes because in general having too many indexes is a bad practice: on
> > each table update we also should update it's indexes and it can cause
> > some performance degradation of the system as a whole, not only
> > optimizer. So we expect the number of indexes will be relatively small.
> >
> > Ignite uses indexes not only for index lookups, but also for exploiting
> > it's sortedness. In this case materialized view's approach can show some
> > advantages. Let's consider the example:
> >
> > SELECT * FROM foo WHERE a > 100 ORDER BY b;
> >
> > where both fields 'a' and 'b' are indexed. In this case we will have two
> > alternatives of the query execution:
> >
> > 1. Use index 'a' for index conditioned scan and then sort rows by 'b'
> > 2. Use index scan over 'b' and then apply filter over 'a' - here we can
> > avoid sorting, because index over 'b' is already sorted.
> >
> > If I understand the approach with FilterTableScan2IndexTableScanRule
> > correctly, at this step the rule should make a decision about which
> > index to use. It can be very tricky because the rule should consider not
> > only filters, but also collations. This leads to increasing the
> > complexity of such rules. With materialized views approach we just
> > register all indexes with their caollation and let the cost model do its
> > job. Our rules are very simple. The complexity is encapsulated in the
> > index scan cost estimation.
> >
> > As for your example with disjunctive predicate:
> >
> > > SELECT * FROM foo WHERE a > 100 or b < 1000;
> >
> > I think with materialized views approach we can do a complex logic as
> > well. For example, we may implement a special rule for such cases:
> > BitmapOrRule. Or, even better, we will rewrite such predicates to a
> > UNION ALL clause. A very good explanation and a comparison of both
> > approaches I've found in this Oracle's blog post [1].
> >
> > As a conclusion: choosing between materialized views approach and
> > IndexRule-based approach is a trade-off between complexity of
> > implementation and search space pollution. I hope, when the
> > Cascades-style optimizer become a part of Calcite, the search space
> > pollution will not be a serious issue anymore.
> >
> >
> > Thanks!
> >
> >
> > [1]
> >
> https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion
> >
> > --
> > Kind Regards
> > Roman Kondakov
> >
> >
> > On 31.05.2020 07:31, Haisheng Yuan wrote:
> > > Thanks Julian and Roman for sharing the experiences for modeling
> indexes.
> > >
> > > Besides using materialized views, which is already proven by Phoenix
> and 

Re: Using indexes rather than table scans with Calcite

2020-05-31 Thread Haisheng Yuan
Hi Roman,

Thank you for sharing your thoughts.

> It can be very tricky because the rule should consider not
> only filters, but also collations. This leads to increasing the
> complexity of such rules.

Logical transformation rules like FilterTableScan2IndexTableScanRule should not 
consider physical properties, like collation, distribution. You forgot that we 
just reached consensus in CALCITE-3972. :)

Regarding the 2nd option that uses index b, it is indeed not that easy for 
Calcite 1.22.0. In latest version, it now becomes possible. After rule 
transformation, during top-down trait request, the collation is passed through 
Filter, down to physical TableScan, which accepts the trait request with 
collation on b, find there is an index on b, and return a new RelNode 
IndexScan. This process can be done in 
EnumerableTableScan#passThrough(required).

> I hope, when the Cascades-style optimizer become a part of Calcite, the 
> search space
> pollution will not be a serious issue anymore.

I hope so too. Top-down style can help alleviate space pollution by space 
pruning. But the space pollution caused by LogicalProject operator and its 
related rules still can't be avoided. :)

Again, thanks for sharing your experience with us.

Haisheng

On 2020/05/31 09:58:36, Roman Kondakov  wrote: 
> Hi Haisheng,
> 
> The basic rationale behind the using materialized views for secondary
> index representation instead of special rules like mentioned
> FilterTableScan2IndexTableScanRule is the simplicity of implementation.
> 
> You are absolutely right that materialized views approach has an obvious
> drawback that it should register all indexes as materialized views in
> the optimizer's search space. But we expect our users will not overuse
> indexes because in general having too many indexes is a bad practice: on
> each table update we also should update it's indexes and it can cause
> some performance degradation of the system as a whole, not only
> optimizer. So we expect the number of indexes will be relatively small.
> 
> Ignite uses indexes not only for index lookups, but also for exploiting
> it's sortedness. In this case materialized view's approach can show some
> advantages. Let's consider the example:
> 
> SELECT * FROM foo WHERE a > 100 ORDER BY b;
> 
> where both fields 'a' and 'b' are indexed. In this case we will have two
> alternatives of the query execution:
> 
> 1. Use index 'a' for index conditioned scan and then sort rows by 'b'
> 2. Use index scan over 'b' and then apply filter over 'a' - here we can
> avoid sorting, because index over 'b' is already sorted.
> 
> If I understand the approach with FilterTableScan2IndexTableScanRule
> correctly, at this step the rule should make a decision about which
> index to use. It can be very tricky because the rule should consider not
> only filters, but also collations. This leads to increasing the
> complexity of such rules. With materialized views approach we just
> register all indexes with their caollation and let the cost model do its
> job. Our rules are very simple. The complexity is encapsulated in the
> index scan cost estimation.
> 
> As for your example with disjunctive predicate:
> 
> > SELECT * FROM foo WHERE a > 100 or b < 1000;
> 
> I think with materialized views approach we can do a complex logic as
> well. For example, we may implement a special rule for such cases:
> BitmapOrRule. Or, even better, we will rewrite such predicates to a
> UNION ALL clause. A very good explanation and a comparison of both
> approaches I've found in this Oracle's blog post [1].
> 
> As a conclusion: choosing between materialized views approach and
> IndexRule-based approach is a trade-off between complexity of
> implementation and search space pollution. I hope, when the
> Cascades-style optimizer become a part of Calcite, the search space
> pollution will not be a serious issue anymore.
> 
> 
> Thanks!
> 
> 
> [1]
> https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion
> 
> -- 
> Kind Regards
> Roman Kondakov
> 
> 
> On 31.05.2020 07:31, Haisheng Yuan wrote:
> > Thanks Julian and Roman for sharing the experiences for modeling indexes.
> > 
> > Besides using materialized views, which is already proven by Phoenix and 
> > Ignite, there is another approach, as mentioned by Vladimir, define your 
> > own rules and indexscan operators.
> > 
> > FilterTableScan2IndexScanRule and its variances, which match Filter over 
> > TableScan, create an IndexScan if the table has corresponding index on the 
> > filter column. You don't have to register different tablescans for all the 
> > indexes, in case you have nearly thousands indexes for a table, say 999, 
> > which is allowed in SQL Server 2016. It can also support more complex 
> > scenario, e.g.:
> > 
> > SELECT * FROM foo WHERE a > 100 or b < 1000;
> > If there is an index on column a, and another index on column b, we may 
> > need to utilize both indexes, through Bitmap TableScan.
> > 
> >  

Re: Using indexes rather than table scans with Calcite

2020-05-31 Thread Roman Kondakov
Hi Haisheng,

The basic rationale behind the using materialized views for secondary
index representation instead of special rules like mentioned
FilterTableScan2IndexTableScanRule is the simplicity of implementation.

You are absolutely right that materialized views approach has an obvious
drawback that it should register all indexes as materialized views in
the optimizer's search space. But we expect our users will not overuse
indexes because in general having too many indexes is a bad practice: on
each table update we also should update it's indexes and it can cause
some performance degradation of the system as a whole, not only
optimizer. So we expect the number of indexes will be relatively small.

Ignite uses indexes not only for index lookups, but also for exploiting
it's sortedness. In this case materialized view's approach can show some
advantages. Let's consider the example:

SELECT * FROM foo WHERE a > 100 ORDER BY b;

where both fields 'a' and 'b' are indexed. In this case we will have two
alternatives of the query execution:

1. Use index 'a' for index conditioned scan and then sort rows by 'b'
2. Use index scan over 'b' and then apply filter over 'a' - here we can
avoid sorting, because index over 'b' is already sorted.

If I understand the approach with FilterTableScan2IndexTableScanRule
correctly, at this step the rule should make a decision about which
index to use. It can be very tricky because the rule should consider not
only filters, but also collations. This leads to increasing the
complexity of such rules. With materialized views approach we just
register all indexes with their caollation and let the cost model do its
job. Our rules are very simple. The complexity is encapsulated in the
index scan cost estimation.

As for your example with disjunctive predicate:

> SELECT * FROM foo WHERE a > 100 or b < 1000;

I think with materialized views approach we can do a complex logic as
well. For example, we may implement a special rule for such cases:
BitmapOrRule. Or, even better, we will rewrite such predicates to a
UNION ALL clause. A very good explanation and a comparison of both
approaches I've found in this Oracle's blog post [1].

As a conclusion: choosing between materialized views approach and
IndexRule-based approach is a trade-off between complexity of
implementation and search space pollution. I hope, when the
Cascades-style optimizer become a part of Calcite, the search space
pollution will not be a serious issue anymore.


Thanks!


[1]
https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion

-- 
Kind Regards
Roman Kondakov


On 31.05.2020 07:31, Haisheng Yuan wrote:
> Thanks Julian and Roman for sharing the experiences for modeling indexes.
> 
> Besides using materialized views, which is already proven by Phoenix and 
> Ignite, there is another approach, as mentioned by Vladimir, define your own 
> rules and indexscan operators.
> 
> FilterTableScan2IndexScanRule and its variances, which match Filter over 
> TableScan, create an IndexScan if the table has corresponding index on the 
> filter column. You don't have to register different tablescans for all the 
> indexes, in case you have nearly thousands indexes for a table, say 999, 
> which is allowed in SQL Server 2016. It can also support more complex 
> scenario, e.g.:
> 
> SELECT * FROM foo WHERE a > 100 or b < 1000;
> If there is an index on column a, and another index on column b, we may need 
> to utilize both indexes, through Bitmap TableScan.
> 
>  Bitmap Table Scan on foo
>->  BitmapOR
>  ->  Bitmap Index Scan on foo_index_a
>Index Condition: (a > 100)
>  ->  Bitmap Index Scan on foo_index_b
>Index Condition: (b < 1000)
> 
> But still, this approach requires some non-trivial work to do.
> 
> Hi Roman, I believe you definitely have consulted both approaches for Apache 
> Ignite to work with indexes, you decided to go with materialized views, there 
> are some reasons and tradeoffs to consider behind your decision. Do you mind 
> sharing with us?
> 
> Thanks,
> Haisheng
> 
> 
> On 2020/05/29 17:34:27, Julian Hyde  wrote: 
>> Materialized views are not a hack, as Vladimir claims. Materialized views 
>> are a fundamental concept in relational algebra, and they are an elegant way 
>> - in my opinion the correct way -  to model indexes (and many other 
>> structures).
>>
>> In Calcite materialized views are a feature in the planner that allows you 
>> to declare a table as equivalent to a query. (You do not need to issue a 
>> CREATE MATERIALIZED VIEW statement. They are internal.) Then you can use 
>> algebra to substitute one for the other, and apply cost-based optimization 
>> to choose between plans with & without the materialized view.
>>
>>
>> Other people have used this approach in the past. Search this list and you 
>> should find discussions. Also I have a talk with Maryann Xue about planning 
>> for indexes in Apache Phoenix[1].
>>
>> Julian

Re: Using indexes rather than table scans with Calcite

2020-05-30 Thread Haisheng Yuan
Thanks Julian and Roman for sharing the experiences for modeling indexes.

Besides using materialized views, which is already proven by Phoenix and 
Ignite, there is another approach, as mentioned by Vladimir, define your own 
rules and indexscan operators.

FilterTableScan2IndexScanRule and its variances, which match Filter over 
TableScan, create an IndexScan if the table has corresponding index on the 
filter column. You don't have to register different tablescans for all the 
indexes, in case you have nearly thousands indexes for a table, say 999, which 
is allowed in SQL Server 2016. It can also support more complex scenario, e.g.:

SELECT * FROM foo WHERE a > 100 or b < 1000;
If there is an index on column a, and another index on column b, we may need to 
utilize both indexes, through Bitmap TableScan.

 Bitmap Table Scan on foo
   ->  BitmapOR
 ->  Bitmap Index Scan on foo_index_a
   Index Condition: (a > 100)
 ->  Bitmap Index Scan on foo_index_b
   Index Condition: (b < 1000)

But still, this approach requires some non-trivial work to do.

Hi Roman, I believe you definitely have consulted both approaches for Apache 
Ignite to work with indexes, you decided to go with materialized views, there 
are some reasons and tradeoffs to consider behind your decision. Do you mind 
sharing with us?

Thanks,
Haisheng


On 2020/05/29 17:34:27, Julian Hyde  wrote: 
> Materialized views are not a hack, as Vladimir claims. Materialized views are 
> a fundamental concept in relational algebra, and they are an elegant way - in 
> my opinion the correct way -  to model indexes (and many other structures).
> 
> In Calcite materialized views are a feature in the planner that allows you to 
> declare a table as equivalent to a query. (You do not need to issue a CREATE 
> MATERIALIZED VIEW statement. They are internal.) Then you can use algebra to 
> substitute one for the other, and apply cost-based optimization to choose 
> between plans with & without the materialized view.
> 
> 
> Other people have used this approach in the past. Search this list and you 
> should find discussions. Also I have a talk with Maryann Xue about planning 
> for indexes in Apache Phoenix[1].
> 
> Julian
> 
> [1] 
> https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
>  
> 
> 
> > On May 29, 2020, at 5:28 AM, Roman Kondakov  
> > wrote:
> > 
> > Hi Tim,
> > 
> > In Apache Ignite we've already faced this challenge. We solved it using
> > materialized views and FilterableTable. Let's consider your example:
> > 
> >> select * from users where country='UK' and some_other_column='foo';
> > 
> > with a primary index and a sorted secondary index (B+Tree?) over the
> > 'country' field.
> > 
> > As a first step Ignite registers all indexes that might be helpful for
> > the query plan in VolcanoPlanner as materialized views. For now we
> > register all indexes in all tables that participate in the query.
> > Registering all indexes might be excessive, but maybe we will apply some
> > pruning later. So it's ok for now to register all indexes.
> > 
> > Index materialized view is very simple: it's just a sorted table scan:
> > 
> > 'SELECT * from tbl ORDER BY idx_fields'
> > 
> > After registering indexes as materialized views, the optimizer's search
> > space will look like this:
> > 
> > Project([*])
> >  Filter[country='UK' and some_other_column='foo']
> >Set0
> > 
> > where Set0 consists of table and index scans:
> > 
> > TableScan('tbl', collation=[])
> > IndexScan('PK', collation=[PK_field])
> > IndexScan('country_idx', collation=[country])
> > 
> > At this step we have our index scans registered in the optimizer within
> > the same equivalence set as a TableScan.
> > 
> > The next step is a pushing filters down to the scans. We do it with a
> > rule which is similar to 'FilterTableScanRule'. After applying this rule
> > we have a search space that is looking like this:
> > 
> > Project([*])
> >  TableScan('tbl', collation=[], filter=[country='UK' and
> > some_other_column='foo'])
> >  IndexScan('PK', collation=[PK_field], filter=[country='UK' and
> > some_other_column='foo'])
> >  IndexScan('country_idx', collation=[country], filter=[country='UK' and
> > some_other_column='foo'])
> > 
> > And the final step is adjusting the cost model to make it select the
> > scan with the lower cost which depends on the filter conditions within
> > the Scan. For example, full table scan with filters
> > 
> > TableScan('tbl', collation=[], filter=[country='UK' and
> > some_other_column='foo'])
> > 
> > will cost, say, 100. Because it have to scan all rows and then filter
> > out some set of them. On the other hand the index scan that can do index
> > lookup instead of full scan will have a less cost. For example
> > 
> > IndexScan('country_idx', 

Re: Using indexes rather than table scans with Calcite

2020-05-29 Thread Danny Chan
Calcite does support table hint now, it's syntax is Oracle style[1], i saw that 
many engines support a INDEX hint to force a index scan on table[2] [3], maybe 
you can have a try also.

[1] https://calcite.apache.org/docs/reference.html#sql-hints
[2] https://docs.oracle.com/cd/B13789_01/server.101/b10752/hintsref.htm#5156
[3] 
https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-table?view=sql-server-ver15

Best,
Danny Chan
在 2020年5月29日 +0800 PM4:44,Tim Fox ,写道:
> Hi,
>
> I'm building a query engine with Calcite - really enjoying working with
> Calcite so far!
>
> When creating a plan, it seems Calcite always creates a plan where the
> sources are table scans, however in my implementation the tables can have
> indexes on them so a table scan is not always the right choice.
>
> I was wondering if there was any way of making Calcite "index aware" - e.g.
> perhaps providing hints to the table scan instance that, actually, an index
> scan or a primary key lookup should be used instead of actually scanning
> the table. E.g. On the table meta-data if we provided information about any
> indexes on the table, then Calcite could figure out what parts of the query
> to push to the table scan and which to keep in the rest of the plan.
>
> There are two specific cases I really care about:
>
> 1. Queries that contain a primary key lookup:
>
> select * from some_table where key_column=23 AND some_other_column='foo';
>
> In the above case the 'select * from some_table where key_column=23' can be
> implemented as a simple PK lookup in the source table, not requiring a
> scan, thus leaving just the filter corresponding to
> 'some_other_column='foo'' in the rest of the plan
>
> 2. Queries with expressions on a column which has a secondary index
>
> select * from users where country='UK' and some_other_column='foo';
>
> We have many users, and let's say 10% of them are from UK (still a lot). We
> have a secondary index in the country column in the source table so we can
> do an efficient index scan to retrieve the matching records.
>
> I found this document
> https://calcite.apache.org/docs/materialized_views.html which seems like it
> might help me in some way.
>
> The idea being if I can think of my indexes as materialized views then the
> query can be written against those materialized views as sources instead of
> the original table sources. There appears to be a rule
> 'MaterializedViewRule' that does this already (?).
>
> This seems to get me a bit further, however, for this approach to work, it
> seems I would have to create materialized views _dynamically_ during
> evaluation of the query, register them, rewrite the query, execute it, then
> deregister the materialized view.
>
> E.g. for the primary key lookup example above, for the following query:
>
> select * from some_table where key_column=23 AND some_other_column='foo';
>
> I would need to dynamically create a materialized view corresponding to:
>
> select * from some_table where key_column=23
>
> Then rewrite the query using MaterializedViewRule.
>
> In the general case, in order to figure out what materialized views I need
> to dynamically create I would need to examine the query, figure out which
> columns in expressions have indexes on them and from them work out the best
> materialized view to create based on that information. This seems non
> trivial.
>
> Does anyone have any suggestions or pointers for how to implement this kind
> of thing? I suspect I'm not the first person to have tried to do this, as
> using indexes on tables seems a pretty common thing in many systems (?)


Re: Using indexes rather than table scans with Calcite

2020-05-29 Thread Julian Hyde
Materialized views are not a hack, as Vladimir claims. Materialized views are a 
fundamental concept in relational algebra, and they are an elegant way - in my 
opinion the correct way -  to model indexes (and many other structures).

In Calcite materialized views are a feature in the planner that allows you to 
declare a table as equivalent to a query. (You do not need to issue a CREATE 
MATERIALIZED VIEW statement. They are internal.) Then you can use algebra to 
substitute one for the other, and apply cost-based optimization to choose 
between plans with & without the materialized view.


Other people have used this approach in the past. Search this list and you 
should find discussions. Also I have a talk with Maryann Xue about planning for 
indexes in Apache Phoenix[1].

Julian

[1] 
https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
 


> On May 29, 2020, at 5:28 AM, Roman Kondakov  
> wrote:
> 
> Hi Tim,
> 
> In Apache Ignite we've already faced this challenge. We solved it using
> materialized views and FilterableTable. Let's consider your example:
> 
>> select * from users where country='UK' and some_other_column='foo';
> 
> with a primary index and a sorted secondary index (B+Tree?) over the
> 'country' field.
> 
> As a first step Ignite registers all indexes that might be helpful for
> the query plan in VolcanoPlanner as materialized views. For now we
> register all indexes in all tables that participate in the query.
> Registering all indexes might be excessive, but maybe we will apply some
> pruning later. So it's ok for now to register all indexes.
> 
> Index materialized view is very simple: it's just a sorted table scan:
> 
> 'SELECT * from tbl ORDER BY idx_fields'
> 
> After registering indexes as materialized views, the optimizer's search
> space will look like this:
> 
> Project([*])
>  Filter[country='UK' and some_other_column='foo']
>Set0
> 
> where Set0 consists of table and index scans:
> 
> TableScan('tbl', collation=[])
> IndexScan('PK', collation=[PK_field])
> IndexScan('country_idx', collation=[country])
> 
> At this step we have our index scans registered in the optimizer within
> the same equivalence set as a TableScan.
> 
> The next step is a pushing filters down to the scans. We do it with a
> rule which is similar to 'FilterTableScanRule'. After applying this rule
> we have a search space that is looking like this:
> 
> Project([*])
>  TableScan('tbl', collation=[], filter=[country='UK' and
> some_other_column='foo'])
>  IndexScan('PK', collation=[PK_field], filter=[country='UK' and
> some_other_column='foo'])
>  IndexScan('country_idx', collation=[country], filter=[country='UK' and
> some_other_column='foo'])
> 
> And the final step is adjusting the cost model to make it select the
> scan with the lower cost which depends on the filter conditions within
> the Scan. For example, full table scan with filters
> 
> TableScan('tbl', collation=[], filter=[country='UK' and
> some_other_column='foo'])
> 
> will cost, say, 100. Because it have to scan all rows and then filter
> out some set of them. On the other hand the index scan that can do index
> lookup instead of full scan will have a less cost. For example
> 
> IndexScan('country_idx', collation=[country], filter=[country='UK' and
> some_other_column='foo'])
> 
> will have cost about ~10, or so because it has a good index condition
> `country='UK'` which can be used for index lookup that that returns only
> 10% of rows. And therefore this IndexScan should be chosen as the best
> plan by the optimizer.
> 
> We've recently implemented this approach in Apache Ignite and it works
> well for us. You can find it in [1]. This PR has many changes that are
> unrelated to the main topic. So particularly you can look at
> `IgnitePlanner.materializations()' method which registers indexes as
> materialized views and `IgniteTableScan` which performs filter
> conditions assessment and index scan cost estimation.
> 
> 
> [1] https://github.com/apache/ignite/pull/7813
> 
> -- 
> Kind Regards
> Roman Kondakov
> 
> 
> On 29.05.2020 11:44, Tim Fox wrote:
>> Hi,
>> 
>> I'm building a query engine with Calcite - really enjoying working with
>> Calcite so far!
>> 
>> When creating a plan, it seems Calcite always creates a plan where the
>> sources are table scans, however in my implementation the tables can have
>> indexes on them so a table scan is not always the right choice.
>> 
>> I was wondering if there was any way of making Calcite "index aware" - e.g.
>> perhaps providing hints to the table scan instance that, actually, an index
>> scan or a primary key lookup should be used instead of actually scanning
>> the table. E.g. On the table meta-data if we provided information about any
>> indexes on the table, then Calcite could figure out what parts of the query
>> to push to the 

Re: Using indexes rather than table scans with Calcite

2020-05-29 Thread Roman Kondakov
Hi Tim,

In Apache Ignite we've already faced this challenge. We solved it using
materialized views and FilterableTable. Let's consider your example:

> select * from users where country='UK' and some_other_column='foo';

with a primary index and a sorted secondary index (B+Tree?) over the
'country' field.

As a first step Ignite registers all indexes that might be helpful for
the query plan in VolcanoPlanner as materialized views. For now we
register all indexes in all tables that participate in the query.
Registering all indexes might be excessive, but maybe we will apply some
pruning later. So it's ok for now to register all indexes.

Index materialized view is very simple: it's just a sorted table scan:

'SELECT * from tbl ORDER BY idx_fields'

After registering indexes as materialized views, the optimizer's search
space will look like this:

Project([*])
  Filter[country='UK' and some_other_column='foo']
Set0

where Set0 consists of table and index scans:

TableScan('tbl', collation=[])
IndexScan('PK', collation=[PK_field])
IndexScan('country_idx', collation=[country])

At this step we have our index scans registered in the optimizer within
the same equivalence set as a TableScan.

The next step is a pushing filters down to the scans. We do it with a
rule which is similar to 'FilterTableScanRule'. After applying this rule
we have a search space that is looking like this:

Project([*])
  TableScan('tbl', collation=[], filter=[country='UK' and
some_other_column='foo'])
  IndexScan('PK', collation=[PK_field], filter=[country='UK' and
some_other_column='foo'])
  IndexScan('country_idx', collation=[country], filter=[country='UK' and
some_other_column='foo'])

And the final step is adjusting the cost model to make it select the
scan with the lower cost which depends on the filter conditions within
the Scan. For example, full table scan with filters

TableScan('tbl', collation=[], filter=[country='UK' and
some_other_column='foo'])

will cost, say, 100. Because it have to scan all rows and then filter
out some set of them. On the other hand the index scan that can do index
lookup instead of full scan will have a less cost. For example

IndexScan('country_idx', collation=[country], filter=[country='UK' and
some_other_column='foo'])

will have cost about ~10, or so because it has a good index condition
`country='UK'` which can be used for index lookup that that returns only
10% of rows. And therefore this IndexScan should be chosen as the best
plan by the optimizer.

We've recently implemented this approach in Apache Ignite and it works
well for us. You can find it in [1]. This PR has many changes that are
unrelated to the main topic. So particularly you can look at
`IgnitePlanner.materializations()' method which registers indexes as
materialized views and `IgniteTableScan` which performs filter
conditions assessment and index scan cost estimation.


[1] https://github.com/apache/ignite/pull/7813

-- 
Kind Regards
Roman Kondakov


On 29.05.2020 11:44, Tim Fox wrote:
> Hi,
> 
> I'm building a query engine with Calcite - really enjoying working with
> Calcite so far!
> 
> When creating a plan, it seems Calcite always creates a plan where the
> sources are table scans, however in my implementation the tables can have
> indexes on them so a table scan is not always the right choice.
> 
> I was wondering if there was any way of making Calcite "index aware" - e.g.
> perhaps providing hints to the table scan instance that, actually, an index
> scan or a primary key lookup should be used instead of actually scanning
> the table. E.g. On the table meta-data if we provided information about any
> indexes on the table, then Calcite could figure out what parts of the query
> to push to the table scan and which to keep in the rest of the plan.
> 
> There are two specific cases I really care about:
> 
> 1. Queries that contain a primary key lookup:
> 
> select * from some_table where key_column=23 AND some_other_column='foo';
> 
> In the above case the 'select * from some_table where key_column=23' can be
> implemented as a simple PK lookup in the source table, not requiring a
> scan, thus leaving just the filter corresponding to
> 'some_other_column='foo'' in the rest of the plan
> 
> 2. Queries with expressions on a column which has a secondary index
> 
> select * from users where country='UK' and some_other_column='foo';
> 
> We have many users, and let's say 10% of them are from UK (still a lot). We
> have a secondary index in the country column in the source table so we can
> do an efficient index scan to retrieve the matching records.
> 
> I found this document
> https://calcite.apache.org/docs/materialized_views.html which seems like it
> might help me in some way.
> 
> The idea being if I can think of my indexes as materialized views then the
> query can be written against those materialized views as sources instead of
> the original table sources. There appears to be a rule
> 'MaterializedViewRule' that does 

Re: Using indexes rather than table scans with Calcite

2020-05-29 Thread Vladimir Ozerov
Hi,

Products that use Apache Calcite typically implement index handling on
their own, since Apache Calcite has only limited support for physical
optimization. You may implement your own index scan operator and rules that
use this operator. For example, take a look how index planning is done in
Apache Drill.

Materialised views is a hacky way to achieve this, since normally they are
used for very different purposes.

пт, 29 мая 2020 г. в 11:44, Tim Fox :

> Hi,
>
> I'm building a query engine with Calcite - really enjoying working with
> Calcite so far!
>
> When creating a plan, it seems Calcite always creates a plan where the
> sources are table scans, however in my implementation the tables can have
> indexes on them so a table scan is not always the right choice.
>
> I was wondering if there was any way of making Calcite "index aware" - e.g.
> perhaps providing hints to the table scan instance that, actually, an index
> scan or a primary key lookup should be used instead of actually scanning
> the table. E.g. On the table meta-data if we provided information about any
> indexes on the table, then Calcite could figure out what parts of the query
> to push to the table scan and which to keep in the rest of the plan.
>
> There are two specific cases I really care about:
>
> 1. Queries that contain a primary key lookup:
>
> select * from some_table where key_column=23 AND some_other_column='foo';
>
> In the above case the 'select * from some_table where key_column=23' can be
> implemented as a simple PK lookup in the source table, not requiring a
> scan, thus leaving just the filter corresponding to
> 'some_other_column='foo'' in the rest of the plan
>
> 2. Queries with expressions on a column which has a secondary index
>
> select * from users where country='UK' and some_other_column='foo';
>
> We have many users, and let's say 10% of them are from UK (still a lot). We
> have a secondary index in the country column in the source table so we can
> do an efficient index scan to retrieve the matching records.
>
> I found this document
> https://calcite.apache.org/docs/materialized_views.html which seems like
> it
> might help me in some way.
>
> The idea being if I can think of my indexes as materialized views then the
> query can be written against those materialized views as sources instead of
> the original table sources. There appears to be a rule
> 'MaterializedViewRule' that does this already (?).
>
> This seems to get me a bit further, however, for this approach to work, it
> seems I would have to create materialized views _dynamically_ during
> evaluation of the query, register them, rewrite the query, execute it, then
> deregister the materialized view.
>
> E.g. for the primary key lookup example above, for the following query:
>
> select * from some_table where key_column=23 AND some_other_column='foo';
>
> I would need to dynamically create a materialized view corresponding to:
>
> select * from some_table where key_column=23
>
> Then rewrite the query using MaterializedViewRule.
>
> In the general case, in order to figure out what materialized views I need
> to dynamically create I would need to examine the query, figure out which
> columns in expressions have indexes on them and from them work out the best
> materialized view to create based on that information. This seems non
> trivial.
>
> Does anyone have any suggestions or pointers for how to implement this kind
> of thing? I suspect I'm not the first person to have tried to do this, as
> using indexes on tables seems a pretty common thing in many systems (?)
>


Using indexes rather than table scans with Calcite

2020-05-29 Thread Tim Fox
Hi,

I'm building a query engine with Calcite - really enjoying working with
Calcite so far!

When creating a plan, it seems Calcite always creates a plan where the
sources are table scans, however in my implementation the tables can have
indexes on them so a table scan is not always the right choice.

I was wondering if there was any way of making Calcite "index aware" - e.g.
perhaps providing hints to the table scan instance that, actually, an index
scan or a primary key lookup should be used instead of actually scanning
the table. E.g. On the table meta-data if we provided information about any
indexes on the table, then Calcite could figure out what parts of the query
to push to the table scan and which to keep in the rest of the plan.

There are two specific cases I really care about:

1. Queries that contain a primary key lookup:

select * from some_table where key_column=23 AND some_other_column='foo';

In the above case the 'select * from some_table where key_column=23' can be
implemented as a simple PK lookup in the source table, not requiring a
scan, thus leaving just the filter corresponding to
'some_other_column='foo'' in the rest of the plan

2. Queries with expressions on a column which has a secondary index

select * from users where country='UK' and some_other_column='foo';

We have many users, and let's say 10% of them are from UK (still a lot). We
have a secondary index in the country column in the source table so we can
do an efficient index scan to retrieve the matching records.

I found this document
https://calcite.apache.org/docs/materialized_views.html which seems like it
might help me in some way.

The idea being if I can think of my indexes as materialized views then the
query can be written against those materialized views as sources instead of
the original table sources. There appears to be a rule
'MaterializedViewRule' that does this already (?).

This seems to get me a bit further, however, for this approach to work, it
seems I would have to create materialized views _dynamically_ during
evaluation of the query, register them, rewrite the query, execute it, then
deregister the materialized view.

E.g. for the primary key lookup example above, for the following query:

select * from some_table where key_column=23 AND some_other_column='foo';

I would need to dynamically create a materialized view corresponding to:

select * from some_table where key_column=23

Then rewrite the query using MaterializedViewRule.

In the general case, in order to figure out what materialized views I need
to dynamically create I would need to examine the query, figure out which
columns in expressions have indexes on them and from them work out the best
materialized view to create based on that information. This seems non
trivial.

Does anyone have any suggestions or pointers for how to implement this kind
of thing? I suspect I'm not the first person to have tried to do this, as
using indexes on tables seems a pretty common thing in many systems (?)