[jira] [Created] (CALCITE-4039) Missing LATERAL keyword after sql validate

2020-06-02 Thread YufeiLiu (Jira)
YufeiLiu created CALCITE-4039:
-

 Summary: Missing LATERAL keyword after sql validate
 Key: CALCITE-4039
 URL: https://issues.apache.org/jira/browse/CALCITE-4039
 Project: Calcite
  Issue Type: Bug
  Components: core
Affects Versions: 1.23.0
Reporter: YufeiLiu


I tried format to sql string after validate SqlNode, but the {{LATERAL}} 
keyword was missing in the result.

original sql:

{code:sql}
SELECT `B`, `C`, `D`, `E`
FROM `source_table`,
LATERAL TABLE(`MY_TABLE_FUNCTION`(`A`)) AS `T` (`B`, `C`, `D`, `E`)
{code}

validated sql:

{code:sql}
SELECT `B`, `C`, `D`, `E`
FROM `source_table`,
TABLE(`MY_TABLE_FUNCTION`(`A`)) AS `T` (`B`, `C`, `D`, `E`)
{code}

In {{SqlValidatorImpl}}, only return newOperand and missing the LATERAL 
operator.

{code:java}
   case LATERAL:
  return registerFrom(
  parentScope,
  usingScope,
  register,
  ((SqlCall) node).operand(0),
  enclosingNode,
  alias,
  extendList,
  forceNullable,
  true);
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Haisheng Yuan
> using this approach, bitmap indexes may be represented as set operations
> (union, intersect) over idx_a and idx_b followed by joining with scan_A
> using 'rowId'.

If I didn't get it wrong, joining with scan_A means it has to read all the 
tuples from table A, this will make index meaningless, because the purpose of 
using index in that example is to avoid reading all the tuples from table A.

> and it looks like it can be implemented without special
> Join2IndexApplyRule: we'll use correlation variable from join condition
> as filter condition.

Perhaps you have different definition of Join, but the Join operator in Calcite 
doesn't have correlation variable, only Correlate operator has, which is 
Calcite's version of Apply. You still need the rule to transform Join to 
Correlate, aka, Apply. In Calcite, the rule is JoinToCorrelateRule. However I 
don't recommend using JoinToCorrelateRule, because it will transform Join to 
Correlate unconditionally, in case of multi-way joins with join reordering 
enabled, it will generate a lot of useless alternatives, unless you want to 
support multi-level correlation, like:

NestedLoopOuterJoin
-> Table Scan on A
->  HashOuterJoin
 Join Condition: B.Y = C.W
-> Table Scan on B
-> Index Scan using C_Z_IDX on C
 Index Condition: C.Z = A.X

or

NestedLoopOuterJoin
-> Table Scan on SmallTable1 A
-> NestedLoopOuterJoin
-> Table Scan on SmallTable2 B
-> Index Scan using XYIndex on LargeTable C
Index Condition: C.X = A.AID and C.Y = B.BID

Otherwise, the join to correlate rule that matches a join with right child as a 
tablescan (or its variants) makes more sense. Transforming a Join to Correlate 
is meaningful only when there is at least 1 index can be used in the right 
relation, so it would be perfect if you can check if there is available index 
in the right relation or not, if you care about the search space.

On 2020/06/02 20:30:41, Roman Kondakov  wrote: 
> Vladimir, Haisheng,
> 
> thank you for sharing your thoughts. And special thanks to Haisheng for
> awesome examples.
> 
> I found Julian's reasoning about representing bitmap indexes as joins
> very deep and interesting. As I understand this idea right, for table
> A(a,b,c) indexes over columns, say, 'a' and 'b' can be represented as
> relations sorted by 'a' and 'b':
> 
> idx_a('rowId', 'a') [collation=[a]]
> idx_b('rowId', 'b') [collation=[b]]
> 
> Table A itself may be represented as a relation
> 
> scan_A('rowId', 'a', 'b', 'c') [collation=[]]
> 
> using this approach, bitmap indexes may be represented as set operations
> (union, intersect) over idx_a and idx_b followed by joining with scan_A
> using 'rowId'.
> 
> Index-only scans (aka covered indexes), can be modeled as scans over
> idx_a or idx_b without consequent join with the main table.
> 
> @Julian, did I get your idea right?
> 
> If one try to apply this approach directly, then the search space will
> be quickly polluted by those index relations. But if the Lattice-like
> framework from the Julian's mail is implemented, then it will be a great
> feature, I guess.
> 
> @Haisheng, BTW we are working on the indexed nested loop join right now
> and it looks like it can be implemented without special
> Join2IndexApplyRule: we'll use correlation variable from join condition
> as filter condition. If there is no index there - it will be a simple
> correlated nested loop join (or Apply). If index is present, the cost
> will be lower because index lookup will speed up this join, and this
> index scan will win.
> 
> > 
> > There is another useful feature, that is IndexApply.
> > 
> > The rule of Join2IndexApply will transform a logical join to logical index 
> > apply, then transform to physical indexed nested loop join.
> > select * from foo join bar on foo.a = bar.f;
> > There is an index on foo.a, but no index on bar.f.
> 
> 
> Thank you!
> 
> 
> -- 
> Kind Regards
> Roman Kondakov
> 
> 
> On 02.06.2020 19:55, Julian Hyde wrote:
> 
> > Vladimir,
> > 
> > I feel the same way. MVs are more powerful and general, and with some 
> > effort they could be just as efficient as other approaches. 
> > 
> > One problem that needs to be solved is the “registration problem”. If you 
> > have a lot of MVs they all have to be registered in the planner’s search 
> > space, otherwise the planner will not find them.
> > 
> > I would like to see a benchmark for this. Say we have 1,000 MVs and we plan 
> > 10,000 queries that potentially use them. The goal is that the planning 
> > cost Is comparable to only having 1 MV. We could achieve the goal by 
> > re-using the MVs across planner instances (thereby amortizing the cost of 
> > creating them) and “index” the planner’s search space.
> > 
> > The Lattice structure essentially does this for families for join-aggregate 
> > MVs (summary tables). We need something similar for project-sort MVs 
> > (indexes). 
> > 

Re: [DISCUSS] Towards Calcite-Avatica 1.17.0

2020-06-02 Thread Francis Chuang

Hey everyone,

Just a reminder that I'd like to make Avatica 1.17.0 rc0 available for 
voting soon. However, I am hoping we can review or merge a few open PRs 
for this release.


If anyone has any spare cycles to review or merge any of the open PRs on 
Github, it'd be highly appreciated!


Francis

On 28/05/2020 8:41 am, Francis Chuang wrote:
Now that Calcite 1.23.0 has been released I think it's a good time to 
try to get Avatica 1.17.0 released.


There are currently 10 open PRs on Github [1].

Can community members please review and comment on them if there are 
some spare cycles available?


If we can get some of those PRs merged, it would be really great and 
they can go into the 1.17.0 release.


Francis

[1] https://github.com/apache/calcite-avatica/pulls

On 5/05/2020 9:00 am, Francis Chuang wrote:
There hasn't been any commits to the repository since I last sent the 
message. Let's focus on Calcite 1.23.0 and revisit Avatica after the 
Calcite release.


Francis

On 20/04/2020 8:32 am, Francis Chuang wrote:

Hey everyone,

I am planning to make rc0 available for voting towards the end of 
April / start of May. How does this look in terms of timing?


Francis

On 12/04/2020 2:51 am, Stamatis Zampetakis wrote:

Thanks again for taking the lead on this Francis!

Personally, I am quite busy these days but will do my best to check 
1-2 PRs.


On Fri, Apr 10, 2020 at 2:57 PM Josh Elser  wrote:


Always a good idea.

I'll add this to my list and see if I can help get any committed. It's
been a while since I've looked at the list.

On 4/8/20 7:59 PM, Francis Chuang wrote:

The last avatica release was in December last year.
  From activity on our mailing lists and the Calcite repository, 
it feels

like there's currently a lull and things aren't as active.

Would this be a good opportunity to work on Avatica and push a 
release

out?


There are currently 10 open PRs for Avatica [1] and I think if 
they are
reviewed and merged, it wold be possible to close a huge chunk of 
them.


I am happy to be RM for this release, but will need help from the
community to review and merge those PRs.

Francis

[1] https://github.com/apache/calcite-avatica/pulls






[jira] [Created] (CALCITE-4038) Refactor RexVisitor, RexBiVisitor, RelOptUtil.InputFinder

2020-06-02 Thread Julian Hyde (Jira)
Julian Hyde created CALCITE-4038:


 Summary: Refactor RexVisitor, RexBiVisitor, RelOptUtil.InputFinder
 Key: CALCITE-4038
 URL: https://issues.apache.org/jira/browse/CALCITE-4038
 Project: Calcite
  Issue Type: Bug
Reporter: Julian Hyde


Various improvements to {{RexVisitor}}, {{RexBiVisitor}}, 
{{RelOptUtil.InputFinder}}  classes.

1. Make mutable field {{RelOptUtil.InputFinder.inputBitSet}} private. The field 
still exists, public and deprecated, but it shadows a new private field. Rather 
than calling {{InputFinder.inputBitSet.build()}} you should now call 
{{InputFinder.build()}}.

2. In {{RexVisitor}} and {{RexBiVisitor}} add methods {{visitList}} and 
{{visitEach}} (the former returns a list, and the latter returns void).

3. Deprecate {{RexShuttle.apply}} in favor of {{visitList}}.

4. Add {{RexWindowBound.accept(RexBiVisitor)}}.

5. For {{RexBiVisitor}} add a method {{visitEachIndexed}}, the 
index passed as the 'payload' argument.

6. Add abstract implementations of {{RexBiVisitor}}: {{RexUnaryBiVisitor}} and 
{{RexBiVisitorImpl}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Roman Kondakov
Vladimir, Haisheng,

thank you for sharing your thoughts. And special thanks to Haisheng for
awesome examples.

I found Julian's reasoning about representing bitmap indexes as joins
very deep and interesting. As I understand this idea right, for table
A(a,b,c) indexes over columns, say, 'a' and 'b' can be represented as
relations sorted by 'a' and 'b':

idx_a('rowId', 'a') [collation=[a]]
idx_b('rowId', 'b') [collation=[b]]

Table A itself may be represented as a relation

scan_A('rowId', 'a', 'b', 'c') [collation=[]]

using this approach, bitmap indexes may be represented as set operations
(union, intersect) over idx_a and idx_b followed by joining with scan_A
using 'rowId'.

Index-only scans (aka covered indexes), can be modeled as scans over
idx_a or idx_b without consequent join with the main table.

@Julian, did I get your idea right?

If one try to apply this approach directly, then the search space will
be quickly polluted by those index relations. But if the Lattice-like
framework from the Julian's mail is implemented, then it will be a great
feature, I guess.

@Haisheng, BTW we are working on the indexed nested loop join right now
and it looks like it can be implemented without special
Join2IndexApplyRule: we'll use correlation variable from join condition
as filter condition. If there is no index there - it will be a simple
correlated nested loop join (or Apply). If index is present, the cost
will be lower because index lookup will speed up this join, and this
index scan will win.

> 
> There is another useful feature, that is IndexApply.
> 
> The rule of Join2IndexApply will transform a logical join to logical index 
> apply, then transform to physical indexed nested loop join.
> select * from foo join bar on foo.a = bar.f;
> There is an index on foo.a, but no index on bar.f.


Thank you!


-- 
Kind Regards
Roman Kondakov


On 02.06.2020 19:55, Julian Hyde wrote:

> Vladimir,
> 
> I feel the same way. MVs are more powerful and general, and with some effort 
> they could be just as efficient as other approaches. 
> 
> One problem that needs to be solved is the “registration problem”. If you 
> have a lot of MVs they all have to be registered in the planner’s search 
> space, otherwise the planner will not find them.
> 
> I would like to see a benchmark for this. Say we have 1,000 MVs and we plan 
> 10,000 queries that potentially use them. The goal is that the planning cost 
> Is comparable to only having 1 MV. We could achieve the goal by re-using the 
> MVs across planner instances (thereby amortizing the cost of creating them) 
> and “index” the planner’s search space.
> 
> The Lattice structure essentially does this for families for join-aggregate 
> MVs (summary tables). We need something similar for project-sort MVs 
> (indexes). 
> 
> Julian






On 02.06.2020 10:05, Vladimir Ozerov wrote:
> Hi Roman,
> 
> To me, there is no principal difference between materialized views and
> rule-based approaches - they are both "rules". In the materialized views
> approach the rule is to create all alternative access paths
> unconditionally, this rule is always fired first during the optimization
> process. A rule-based approach gives you flexibility. You may do exactly
> the same as with materialized views - just enumerate all access paths and
> add them to search space. But at the same time, you may develop more
> complicated things, such as pruning, bitmap index joins, etc. This is not
> possible with materialized views.
> 
> In this sense, the difference in complexity between materialized views
> approach and rule-based approach with similar features (no pruning, etc) is
> close to zero - in both cases you just iterate over existing indexes and
> add them to search space. Rule-based approach becomes more complex when you
> add more features to it.
> 
> вт, 2 июн. 2020 г. в 00:37, Haisheng Yuan :
> 
>> Hi Roman,
>>
>> There are some potential advantages.
>>
>> Let's still use this as example.
>> SELECT * FROM foo WHERE a > 100 or b < 1000;
>> Both column a and b have B-Tree indexes, note that it doesn't have to be
>> bitmap index.
>>
>>  Bitmap Table Scan on foo
>>->  BitmapOr
>>  ->  Bitmap Index Scan on foo_index_a
>>Index Cond: (a > 100)
>>  ->  Bitmap Index Scan on foo_index_b
>>Index Cond: (b < 1000)
>>
>> filter a:  
>> filter b:---
>>
>> The bitmap scan just needs to read the tuples of satisfying the condition
>> once, because the BitmapOr will eliminate duplicate tuple id. Constructing
>> to UNION ALL will read the intersecting part twice, and it will increase
>> the search space a little bit even when there is no available index.
>>
>> If we change the condition from OR to AND:
>> SELECT * FROM foo WHERE a > 100 and b < 1000;
>>
>> Will MV-based approach have to choose which index to use?
>>
>> I would prefer the following plan:
>>  Bitmap Table Scan on foo
>>->  BitmapAnd
>>  ->  

Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Julian Hyde
Vladimir,

I feel the same way. MVs are more powerful and general, and with some effort 
they could be just as efficient as other approaches. 

One problem that needs to be solved is the “registration problem”. If you have 
a lot of MVs they all have to be registered in the planner’s search space, 
otherwise the planner will not find them.

I would like to see a benchmark for this. Say we have 1,000 MVs and we plan 
10,000 queries that potentially use them. The goal is that the planning cost Is 
comparable to only having 1 MV. We could achieve the goal by re-using the MVs 
across planner instances (thereby amortizing the cost of creating them) and 
“index” the planner’s search space.

The Lattice structure essentially does this for families for join-aggregate MVs 
(summary tables). We need something similar for project-sort MVs (indexes). 

Julian

> On Jun 2, 2020, at 12:05 AM, Vladimir Ozerov  wrote:
> 


Re: Regarding CALCITE-3997

2020-06-02 Thread Ruben Q L
Hello Haisheng,

thanks for bringing up this discussion. I was also part of the people who
expressed concerns about CALCITE-3997 due to a backwards compatibility
issue in a very specific scenario in a downstream project. This issue was
satisfactory handled with a temporary workaround for the RC, and now this
workaround needs to be removed for the next version (CALCITE-4032).
I agree with your analysis, and I think my specific scenario should not
prevent this new general approach to be fully implemented (including
CalcMergerRule). Thus, I am ok with CALCITE-4032, I will implement my own
rule on my side to solve my specific problem, under no circumstances this
should be the responsibility of the Calcite library.

Best regards,
Ruben




Le lun. 1 juin 2020 à 23:31, Stamatis Zampetakis  a
écrit :

> Hello,
>
> Since I was among the people who raised some concerns about this during the
> vote I would like to mention that it was mostly about the fact it came very
> late in the release cycle.
> It is better to avoid breaking changes in the RC phase since it may be
> difficult for people to react or test it.
>
> I agree with Haisheng that Calcite code should follow the best practice so
> that other projects can draw inspiration from it. Nevertheless, I would
> like to mention again that Enumerable is not a convention that is used only
> for demonstration purposes but it is also used in production systems so we
> have to be careful about backward compatibility. Still nobody complaint so
> far so it means that we are doing pretty good on this aspect :)
>
> The release brought some very cool features and I fully support the
> direction on which we are going. I don't think there is a need to open a
> new thread here since we can discuss under the respective jira(s) if
> necessary.
>
> Best,
> Stamatis
>
>
>
>
>
>
> On Wed, May 27, 2020 at 5:07 AM Haisheng Yuan  wrote:
>
> > Hi all,
> >
> > During vote of 1.23.0 rc1, there are some discussions about the solution
> > of CALCITE-3997.
> >
> > Some may think it is a matter of best practice, but in many cases it is a
> > matter of right or wrong.
> >
> > Here is the rationale.
> >
> > The original query in CALCITE-3997 is
> >
> > SELECT t1.k1, t2.k1 FROM t1, t2
> > ON t1.k1=t2.k1
> > WHERE t1.n1 + 1 = t2.n3
> >
> > After FilterIntoJoin and MergeJoinRule, planner generates an alternative:
> > MergeJoin on t1.k1=t2.k1 and t1.n1 + 1 = t2.n3
> >
> > The mergejoin's collation is [k1].
> >
> > This mergejoin matches JoinPushExpressionsRule, which pushes expression
> in
> > join condition, and generates a Project in child.
> > Project t1.k1, t2.k1
> >   -> MergeJoin on t1.k1=t2.k1 and t1.col = t2.n3
> >-> Project k1, n1+1 as col
> > -> TableScan t1
> >-> TableScan t2
> >
> > The new mergejoin now has 2 equi-conditions, but the collation is the
> same
> > with old mergejoin, still [k1], which is wrong.
> > It should be updated to [k1,col] and [k1, n3]. Hence error.
> >
> > The similar issue already happened before, see CALCITE-3353. And it is
> > very easy to trigger same error using different query even without
> > FilterIntoJoin. If we change WHERE to AND in the query,
> > SELECT t1.k1, t2.k1 FROM t1, t2
> > ON t1.k1=t2.k1
> > AND t1.n1 + 1 = t2.n3
> >
> > or with the following query with join condition const reductions on, we
> > can still see the issue.
> > SELECT t1.k1, t2.k1 FROM t1, t2
> > ON t1.k1=t2.k1
> > AND (t1.n1 + 1 = t2.n3 or t1.n1 + 1 = t2.n3)
> >
> > Can we use the similar approach that in CALCITE-3353 to fix the issue?
> > We can, but it just hide the issue. Doing that way can generate mergejoin
> > with correct traitset, but the rule is still generating an invalid plan.
> > Because the MergeJoin's new input is a LogicalProject, which is generated
> > by RelBuilder.
> >
> > In Calcite, RelNode is a very special class. Unlike operators in Cascades
> > or Volcano paper, where logical/physical operator and plan expression are
> > separated, when constructing logical/physical expression tree, each
> > operator doesn't require anything trait from its child. But RelNode in
> > Calcite not only represents operator, but also plan expression, and can
> > even represent a MEMO group with required traitset, which is RelSubset.
> > When constructing a RelNode XXX, we need to specify the input RelNodes,
> the
> > input nodes' traitsets will become node XXX's requirement on those
> inputs,
> > even we are not aware that we are requesting it, which is extremely
> > error-prone.
> >
> > In above case, the new MergeJoin's left input is a LogicalProject, which
> > means the physical MergeJoin is requesting its left input to be logical
> > operator with NONE convention, which has {inf} large cost. That means the
> > newly generated MergeJoin is never implementable.
> >
> > What if we have a physical RelBuilder to generate the Project in the
> rule?
> > The consequence is worse.
> > With logical RelBuilder, it just generates an invalid plan, we are 

Re: Using indexes rather than table scans with Calcite

2020-06-02 Thread Vladimir Ozerov
Hi Roman,

To me, there is no principal difference between materialized views and
rule-based approaches - they are both "rules". In the materialized views
approach the rule is to create all alternative access paths
unconditionally, this rule is always fired first during the optimization
process. A rule-based approach gives you flexibility. You may do exactly
the same as with materialized views - just enumerate all access paths and
add them to search space. But at the same time, you may develop more
complicated things, such as pruning, bitmap index joins, etc. This is not
possible with materialized views.

In this sense, the difference in complexity between materialized views
approach and rule-based approach with similar features (no pruning, etc) is
close to zero - in both cases you just iterate over existing indexes and
add them to search space. Rule-based approach becomes more complex when you
add more features to it.

вт, 2 июн. 2020 г. в 00:37, Haisheng Yuan :

> Hi Roman,
>
> There are some potential advantages.
>
> Let's still use this as example.
> SELECT * FROM foo WHERE a > 100 or b < 1000;
> Both column a and b have B-Tree indexes, note that it doesn't have to be
> bitmap index.
>
>  Bitmap Table Scan on foo
>->  BitmapOr
>  ->  Bitmap Index Scan on foo_index_a
>Index Cond: (a > 100)
>  ->  Bitmap Index Scan on foo_index_b
>Index Cond: (b < 1000)
>
> filter a:  
> filter b:---
>
> The bitmap scan just needs to read the tuples of satisfying the condition
> once, because the BitmapOr will eliminate duplicate tuple id. Constructing
> to UNION ALL will read the intersecting part twice, and it will increase
> the search space a little bit even when there is no available index.
>
> If we change the condition from OR to AND:
> SELECT * FROM foo WHERE a > 100 and b < 1000;
>
> Will MV-based approach have to choose which index to use?
>
> I would prefer the following plan:
>  Bitmap Table Scan on foo
>->  BitmapAnd
>  ->  Bitmap Index Scan on foo_index_a
>Index Cond: (a > 100)
>  ->  Bitmap Index Scan on foo_index_b
>Index Cond: (b < 1000)
>
> filter a:  
> filter b: ---
>
> It would be extremely helpful when the intersecting range is very small,
> because it just need to read the intersected data page only to fetch tuples.
>
> This can applies to arbitrary combinations of AND/OR conditions.
>
> This is an attractive feature for data warehousing customers.
>
> Another example:
> SELECT max(a) from foo;
>
> The query can be transformed to
> SELECT a from foo order by a desc limit 1;
>
> If there is a index on column 'a' but with ASC order.
> The collation trait request will pass to TableScan, if there is an index
> on the collation key. We just need to transform to an IndexScan with
> reverse index scan direction, which is already possible in current Calcite.
>
> Similarly, the user query may be simple but the index on a is ASC order:
> SELECT * from foo order by a desc;
>
> MV-based approach would have to register all the tablescan again with
> reversed index scan direction.
>
> There is another useful feature, that is IndexApply.
>
> The rule of Join2IndexApply will transform a logical join to logical index
> apply, then transform to physical indexed nested loop join.
> select * from foo join bar on foo.a = bar.f;
> There is an index on foo.a, but no index on bar.f.
>
> Nested Loop Inner Join
>   Join Filter: true
>   -> TableScan on bar
>   -> Index Scan using foo_index_a on foo
>Index Cond: foo.a = bar.f
>
> MV-based approach would have to sort table bar if it want to use the index.
>
> But considering Ignite is a in-memory computing platform, MV-based
> approach would be just good enough to serve its purpose.
>
> Looking at Drill's DbScanToIndexScanPrule.java, the way Apache Drill uses
> to tackle with index is not wrong, but I would not recommend to follow
> Drill's example to do it.
>
> Thanks,
> Haisheng
>
> On 2020/06/01 19:15:32, Julian Hyde  wrote:
> > I'm pleased there are a variety of approaches. People should use
> > whichever works for them and their use case.
> >
> > The so-called "rule-based" approach is definitely useful for
> > OLTP-style queries where you are accessing a few rows and you need to
> > plan quickly.
> >
> > Haisheng mentioned bitmap indexes a while back. Optimizing queries to
> > use bitmap indexes is interesting because they can be combined in more
> > ways than regular indexes. In this case, and many others, it is worth
> > thinking about the problem at a high level. For example, combining two
> > bitmap indexes can be modeled as a join, where the join keys are the
> > record ids, and the record ids are sorted within each index value.
> > Thinking at the high level can find plans that the rule-based approach
> > will never find.
> >
> > Indexes-as-MVs,