[jira] [Created] (CALCITE-4039) Missing LATERAL keyword after sql validate
YufeiLiu created CALCITE-4039: - Summary: Missing LATERAL keyword after sql validate Key: CALCITE-4039 URL: https://issues.apache.org/jira/browse/CALCITE-4039 Project: Calcite Issue Type: Bug Components: core Affects Versions: 1.23.0 Reporter: YufeiLiu I tried format to sql string after validate SqlNode, but the {{LATERAL}} keyword was missing in the result. original sql: {code:sql} SELECT `B`, `C`, `D`, `E` FROM `source_table`, LATERAL TABLE(`MY_TABLE_FUNCTION`(`A`)) AS `T` (`B`, `C`, `D`, `E`) {code} validated sql: {code:sql} SELECT `B`, `C`, `D`, `E` FROM `source_table`, TABLE(`MY_TABLE_FUNCTION`(`A`)) AS `T` (`B`, `C`, `D`, `E`) {code} In {{SqlValidatorImpl}}, only return newOperand and missing the LATERAL operator. {code:java} case LATERAL: return registerFrom( parentScope, usingScope, register, ((SqlCall) node).operand(0), enclosingNode, alias, extendList, forceNullable, true); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Using indexes rather than table scans with Calcite
> using this approach, bitmap indexes may be represented as set operations > (union, intersect) over idx_a and idx_b followed by joining with scan_A > using 'rowId'. If I didn't get it wrong, joining with scan_A means it has to read all the tuples from table A, this will make index meaningless, because the purpose of using index in that example is to avoid reading all the tuples from table A. > and it looks like it can be implemented without special > Join2IndexApplyRule: we'll use correlation variable from join condition > as filter condition. Perhaps you have different definition of Join, but the Join operator in Calcite doesn't have correlation variable, only Correlate operator has, which is Calcite's version of Apply. You still need the rule to transform Join to Correlate, aka, Apply. In Calcite, the rule is JoinToCorrelateRule. However I don't recommend using JoinToCorrelateRule, because it will transform Join to Correlate unconditionally, in case of multi-way joins with join reordering enabled, it will generate a lot of useless alternatives, unless you want to support multi-level correlation, like: NestedLoopOuterJoin -> Table Scan on A -> HashOuterJoin Join Condition: B.Y = C.W -> Table Scan on B -> Index Scan using C_Z_IDX on C Index Condition: C.Z = A.X or NestedLoopOuterJoin -> Table Scan on SmallTable1 A -> NestedLoopOuterJoin -> Table Scan on SmallTable2 B -> Index Scan using XYIndex on LargeTable C Index Condition: C.X = A.AID and C.Y = B.BID Otherwise, the join to correlate rule that matches a join with right child as a tablescan (or its variants) makes more sense. Transforming a Join to Correlate is meaningful only when there is at least 1 index can be used in the right relation, so it would be perfect if you can check if there is available index in the right relation or not, if you care about the search space. On 2020/06/02 20:30:41, Roman Kondakov wrote: > Vladimir, Haisheng, > > thank you for sharing your thoughts. And special thanks to Haisheng for > awesome examples. > > I found Julian's reasoning about representing bitmap indexes as joins > very deep and interesting. As I understand this idea right, for table > A(a,b,c) indexes over columns, say, 'a' and 'b' can be represented as > relations sorted by 'a' and 'b': > > idx_a('rowId', 'a') [collation=[a]] > idx_b('rowId', 'b') [collation=[b]] > > Table A itself may be represented as a relation > > scan_A('rowId', 'a', 'b', 'c') [collation=[]] > > using this approach, bitmap indexes may be represented as set operations > (union, intersect) over idx_a and idx_b followed by joining with scan_A > using 'rowId'. > > Index-only scans (aka covered indexes), can be modeled as scans over > idx_a or idx_b without consequent join with the main table. > > @Julian, did I get your idea right? > > If one try to apply this approach directly, then the search space will > be quickly polluted by those index relations. But if the Lattice-like > framework from the Julian's mail is implemented, then it will be a great > feature, I guess. > > @Haisheng, BTW we are working on the indexed nested loop join right now > and it looks like it can be implemented without special > Join2IndexApplyRule: we'll use correlation variable from join condition > as filter condition. If there is no index there - it will be a simple > correlated nested loop join (or Apply). If index is present, the cost > will be lower because index lookup will speed up this join, and this > index scan will win. > > > > > There is another useful feature, that is IndexApply. > > > > The rule of Join2IndexApply will transform a logical join to logical index > > apply, then transform to physical indexed nested loop join. > > select * from foo join bar on foo.a = bar.f; > > There is an index on foo.a, but no index on bar.f. > > > Thank you! > > > -- > Kind Regards > Roman Kondakov > > > On 02.06.2020 19:55, Julian Hyde wrote: > > > Vladimir, > > > > I feel the same way. MVs are more powerful and general, and with some > > effort they could be just as efficient as other approaches. > > > > One problem that needs to be solved is the “registration problem”. If you > > have a lot of MVs they all have to be registered in the planner’s search > > space, otherwise the planner will not find them. > > > > I would like to see a benchmark for this. Say we have 1,000 MVs and we plan > > 10,000 queries that potentially use them. The goal is that the planning > > cost Is comparable to only having 1 MV. We could achieve the goal by > > re-using the MVs across planner instances (thereby amortizing the cost of > > creating them) and “index” the planner’s search space. > > > > The Lattice structure essentially does this for families for join-aggregate > > MVs (summary tables). We need something similar for project-sort MVs > > (indexes). > >
Re: [DISCUSS] Towards Calcite-Avatica 1.17.0
Hey everyone, Just a reminder that I'd like to make Avatica 1.17.0 rc0 available for voting soon. However, I am hoping we can review or merge a few open PRs for this release. If anyone has any spare cycles to review or merge any of the open PRs on Github, it'd be highly appreciated! Francis On 28/05/2020 8:41 am, Francis Chuang wrote: Now that Calcite 1.23.0 has been released I think it's a good time to try to get Avatica 1.17.0 released. There are currently 10 open PRs on Github [1]. Can community members please review and comment on them if there are some spare cycles available? If we can get some of those PRs merged, it would be really great and they can go into the 1.17.0 release. Francis [1] https://github.com/apache/calcite-avatica/pulls On 5/05/2020 9:00 am, Francis Chuang wrote: There hasn't been any commits to the repository since I last sent the message. Let's focus on Calcite 1.23.0 and revisit Avatica after the Calcite release. Francis On 20/04/2020 8:32 am, Francis Chuang wrote: Hey everyone, I am planning to make rc0 available for voting towards the end of April / start of May. How does this look in terms of timing? Francis On 12/04/2020 2:51 am, Stamatis Zampetakis wrote: Thanks again for taking the lead on this Francis! Personally, I am quite busy these days but will do my best to check 1-2 PRs. On Fri, Apr 10, 2020 at 2:57 PM Josh Elser wrote: Always a good idea. I'll add this to my list and see if I can help get any committed. It's been a while since I've looked at the list. On 4/8/20 7:59 PM, Francis Chuang wrote: The last avatica release was in December last year. From activity on our mailing lists and the Calcite repository, it feels like there's currently a lull and things aren't as active. Would this be a good opportunity to work on Avatica and push a release out? There are currently 10 open PRs for Avatica [1] and I think if they are reviewed and merged, it wold be possible to close a huge chunk of them. I am happy to be RM for this release, but will need help from the community to review and merge those PRs. Francis [1] https://github.com/apache/calcite-avatica/pulls
[jira] [Created] (CALCITE-4038) Refactor RexVisitor, RexBiVisitor, RelOptUtil.InputFinder
Julian Hyde created CALCITE-4038: Summary: Refactor RexVisitor, RexBiVisitor, RelOptUtil.InputFinder Key: CALCITE-4038 URL: https://issues.apache.org/jira/browse/CALCITE-4038 Project: Calcite Issue Type: Bug Reporter: Julian Hyde Various improvements to {{RexVisitor}}, {{RexBiVisitor}}, {{RelOptUtil.InputFinder}} classes. 1. Make mutable field {{RelOptUtil.InputFinder.inputBitSet}} private. The field still exists, public and deprecated, but it shadows a new private field. Rather than calling {{InputFinder.inputBitSet.build()}} you should now call {{InputFinder.build()}}. 2. In {{RexVisitor}} and {{RexBiVisitor}} add methods {{visitList}} and {{visitEach}} (the former returns a list, and the latter returns void). 3. Deprecate {{RexShuttle.apply}} in favor of {{visitList}}. 4. Add {{RexWindowBound.accept(RexBiVisitor)}}. 5. For {{RexBiVisitor}} add a method {{visitEachIndexed}}, the index passed as the 'payload' argument. 6. Add abstract implementations of {{RexBiVisitor}}: {{RexUnaryBiVisitor}} and {{RexBiVisitorImpl}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Using indexes rather than table scans with Calcite
Vladimir, Haisheng, thank you for sharing your thoughts. And special thanks to Haisheng for awesome examples. I found Julian's reasoning about representing bitmap indexes as joins very deep and interesting. As I understand this idea right, for table A(a,b,c) indexes over columns, say, 'a' and 'b' can be represented as relations sorted by 'a' and 'b': idx_a('rowId', 'a') [collation=[a]] idx_b('rowId', 'b') [collation=[b]] Table A itself may be represented as a relation scan_A('rowId', 'a', 'b', 'c') [collation=[]] using this approach, bitmap indexes may be represented as set operations (union, intersect) over idx_a and idx_b followed by joining with scan_A using 'rowId'. Index-only scans (aka covered indexes), can be modeled as scans over idx_a or idx_b without consequent join with the main table. @Julian, did I get your idea right? If one try to apply this approach directly, then the search space will be quickly polluted by those index relations. But if the Lattice-like framework from the Julian's mail is implemented, then it will be a great feature, I guess. @Haisheng, BTW we are working on the indexed nested loop join right now and it looks like it can be implemented without special Join2IndexApplyRule: we'll use correlation variable from join condition as filter condition. If there is no index there - it will be a simple correlated nested loop join (or Apply). If index is present, the cost will be lower because index lookup will speed up this join, and this index scan will win. > > There is another useful feature, that is IndexApply. > > The rule of Join2IndexApply will transform a logical join to logical index > apply, then transform to physical indexed nested loop join. > select * from foo join bar on foo.a = bar.f; > There is an index on foo.a, but no index on bar.f. Thank you! -- Kind Regards Roman Kondakov On 02.06.2020 19:55, Julian Hyde wrote: > Vladimir, > > I feel the same way. MVs are more powerful and general, and with some effort > they could be just as efficient as other approaches. > > One problem that needs to be solved is the “registration problem”. If you > have a lot of MVs they all have to be registered in the planner’s search > space, otherwise the planner will not find them. > > I would like to see a benchmark for this. Say we have 1,000 MVs and we plan > 10,000 queries that potentially use them. The goal is that the planning cost > Is comparable to only having 1 MV. We could achieve the goal by re-using the > MVs across planner instances (thereby amortizing the cost of creating them) > and “index” the planner’s search space. > > The Lattice structure essentially does this for families for join-aggregate > MVs (summary tables). We need something similar for project-sort MVs > (indexes). > > Julian On 02.06.2020 10:05, Vladimir Ozerov wrote: > Hi Roman, > > To me, there is no principal difference between materialized views and > rule-based approaches - they are both "rules". In the materialized views > approach the rule is to create all alternative access paths > unconditionally, this rule is always fired first during the optimization > process. A rule-based approach gives you flexibility. You may do exactly > the same as with materialized views - just enumerate all access paths and > add them to search space. But at the same time, you may develop more > complicated things, such as pruning, bitmap index joins, etc. This is not > possible with materialized views. > > In this sense, the difference in complexity between materialized views > approach and rule-based approach with similar features (no pruning, etc) is > close to zero - in both cases you just iterate over existing indexes and > add them to search space. Rule-based approach becomes more complex when you > add more features to it. > > вт, 2 июн. 2020 г. в 00:37, Haisheng Yuan : > >> Hi Roman, >> >> There are some potential advantages. >> >> Let's still use this as example. >> SELECT * FROM foo WHERE a > 100 or b < 1000; >> Both column a and b have B-Tree indexes, note that it doesn't have to be >> bitmap index. >> >> Bitmap Table Scan on foo >>-> BitmapOr >> -> Bitmap Index Scan on foo_index_a >>Index Cond: (a > 100) >> -> Bitmap Index Scan on foo_index_b >>Index Cond: (b < 1000) >> >> filter a: >> filter b:--- >> >> The bitmap scan just needs to read the tuples of satisfying the condition >> once, because the BitmapOr will eliminate duplicate tuple id. Constructing >> to UNION ALL will read the intersecting part twice, and it will increase >> the search space a little bit even when there is no available index. >> >> If we change the condition from OR to AND: >> SELECT * FROM foo WHERE a > 100 and b < 1000; >> >> Will MV-based approach have to choose which index to use? >> >> I would prefer the following plan: >> Bitmap Table Scan on foo >>-> BitmapAnd >> ->
Re: Using indexes rather than table scans with Calcite
Vladimir, I feel the same way. MVs are more powerful and general, and with some effort they could be just as efficient as other approaches. One problem that needs to be solved is the “registration problem”. If you have a lot of MVs they all have to be registered in the planner’s search space, otherwise the planner will not find them. I would like to see a benchmark for this. Say we have 1,000 MVs and we plan 10,000 queries that potentially use them. The goal is that the planning cost Is comparable to only having 1 MV. We could achieve the goal by re-using the MVs across planner instances (thereby amortizing the cost of creating them) and “index” the planner’s search space. The Lattice structure essentially does this for families for join-aggregate MVs (summary tables). We need something similar for project-sort MVs (indexes). Julian > On Jun 2, 2020, at 12:05 AM, Vladimir Ozerov wrote: >
Re: Regarding CALCITE-3997
Hello Haisheng, thanks for bringing up this discussion. I was also part of the people who expressed concerns about CALCITE-3997 due to a backwards compatibility issue in a very specific scenario in a downstream project. This issue was satisfactory handled with a temporary workaround for the RC, and now this workaround needs to be removed for the next version (CALCITE-4032). I agree with your analysis, and I think my specific scenario should not prevent this new general approach to be fully implemented (including CalcMergerRule). Thus, I am ok with CALCITE-4032, I will implement my own rule on my side to solve my specific problem, under no circumstances this should be the responsibility of the Calcite library. Best regards, Ruben Le lun. 1 juin 2020 à 23:31, Stamatis Zampetakis a écrit : > Hello, > > Since I was among the people who raised some concerns about this during the > vote I would like to mention that it was mostly about the fact it came very > late in the release cycle. > It is better to avoid breaking changes in the RC phase since it may be > difficult for people to react or test it. > > I agree with Haisheng that Calcite code should follow the best practice so > that other projects can draw inspiration from it. Nevertheless, I would > like to mention again that Enumerable is not a convention that is used only > for demonstration purposes but it is also used in production systems so we > have to be careful about backward compatibility. Still nobody complaint so > far so it means that we are doing pretty good on this aspect :) > > The release brought some very cool features and I fully support the > direction on which we are going. I don't think there is a need to open a > new thread here since we can discuss under the respective jira(s) if > necessary. > > Best, > Stamatis > > > > > > > On Wed, May 27, 2020 at 5:07 AM Haisheng Yuan wrote: > > > Hi all, > > > > During vote of 1.23.0 rc1, there are some discussions about the solution > > of CALCITE-3997. > > > > Some may think it is a matter of best practice, but in many cases it is a > > matter of right or wrong. > > > > Here is the rationale. > > > > The original query in CALCITE-3997 is > > > > SELECT t1.k1, t2.k1 FROM t1, t2 > > ON t1.k1=t2.k1 > > WHERE t1.n1 + 1 = t2.n3 > > > > After FilterIntoJoin and MergeJoinRule, planner generates an alternative: > > MergeJoin on t1.k1=t2.k1 and t1.n1 + 1 = t2.n3 > > > > The mergejoin's collation is [k1]. > > > > This mergejoin matches JoinPushExpressionsRule, which pushes expression > in > > join condition, and generates a Project in child. > > Project t1.k1, t2.k1 > > -> MergeJoin on t1.k1=t2.k1 and t1.col = t2.n3 > >-> Project k1, n1+1 as col > > -> TableScan t1 > >-> TableScan t2 > > > > The new mergejoin now has 2 equi-conditions, but the collation is the > same > > with old mergejoin, still [k1], which is wrong. > > It should be updated to [k1,col] and [k1, n3]. Hence error. > > > > The similar issue already happened before, see CALCITE-3353. And it is > > very easy to trigger same error using different query even without > > FilterIntoJoin. If we change WHERE to AND in the query, > > SELECT t1.k1, t2.k1 FROM t1, t2 > > ON t1.k1=t2.k1 > > AND t1.n1 + 1 = t2.n3 > > > > or with the following query with join condition const reductions on, we > > can still see the issue. > > SELECT t1.k1, t2.k1 FROM t1, t2 > > ON t1.k1=t2.k1 > > AND (t1.n1 + 1 = t2.n3 or t1.n1 + 1 = t2.n3) > > > > Can we use the similar approach that in CALCITE-3353 to fix the issue? > > We can, but it just hide the issue. Doing that way can generate mergejoin > > with correct traitset, but the rule is still generating an invalid plan. > > Because the MergeJoin's new input is a LogicalProject, which is generated > > by RelBuilder. > > > > In Calcite, RelNode is a very special class. Unlike operators in Cascades > > or Volcano paper, where logical/physical operator and plan expression are > > separated, when constructing logical/physical expression tree, each > > operator doesn't require anything trait from its child. But RelNode in > > Calcite not only represents operator, but also plan expression, and can > > even represent a MEMO group with required traitset, which is RelSubset. > > When constructing a RelNode XXX, we need to specify the input RelNodes, > the > > input nodes' traitsets will become node XXX's requirement on those > inputs, > > even we are not aware that we are requesting it, which is extremely > > error-prone. > > > > In above case, the new MergeJoin's left input is a LogicalProject, which > > means the physical MergeJoin is requesting its left input to be logical > > operator with NONE convention, which has {inf} large cost. That means the > > newly generated MergeJoin is never implementable. > > > > What if we have a physical RelBuilder to generate the Project in the > rule? > > The consequence is worse. > > With logical RelBuilder, it just generates an invalid plan, we are
Re: Using indexes rather than table scans with Calcite
Hi Roman, To me, there is no principal difference between materialized views and rule-based approaches - they are both "rules". In the materialized views approach the rule is to create all alternative access paths unconditionally, this rule is always fired first during the optimization process. A rule-based approach gives you flexibility. You may do exactly the same as with materialized views - just enumerate all access paths and add them to search space. But at the same time, you may develop more complicated things, such as pruning, bitmap index joins, etc. This is not possible with materialized views. In this sense, the difference in complexity between materialized views approach and rule-based approach with similar features (no pruning, etc) is close to zero - in both cases you just iterate over existing indexes and add them to search space. Rule-based approach becomes more complex when you add more features to it. вт, 2 июн. 2020 г. в 00:37, Haisheng Yuan : > Hi Roman, > > There are some potential advantages. > > Let's still use this as example. > SELECT * FROM foo WHERE a > 100 or b < 1000; > Both column a and b have B-Tree indexes, note that it doesn't have to be > bitmap index. > > Bitmap Table Scan on foo >-> BitmapOr > -> Bitmap Index Scan on foo_index_a >Index Cond: (a > 100) > -> Bitmap Index Scan on foo_index_b >Index Cond: (b < 1000) > > filter a: > filter b:--- > > The bitmap scan just needs to read the tuples of satisfying the condition > once, because the BitmapOr will eliminate duplicate tuple id. Constructing > to UNION ALL will read the intersecting part twice, and it will increase > the search space a little bit even when there is no available index. > > If we change the condition from OR to AND: > SELECT * FROM foo WHERE a > 100 and b < 1000; > > Will MV-based approach have to choose which index to use? > > I would prefer the following plan: > Bitmap Table Scan on foo >-> BitmapAnd > -> Bitmap Index Scan on foo_index_a >Index Cond: (a > 100) > -> Bitmap Index Scan on foo_index_b >Index Cond: (b < 1000) > > filter a: > filter b: --- > > It would be extremely helpful when the intersecting range is very small, > because it just need to read the intersected data page only to fetch tuples. > > This can applies to arbitrary combinations of AND/OR conditions. > > This is an attractive feature for data warehousing customers. > > Another example: > SELECT max(a) from foo; > > The query can be transformed to > SELECT a from foo order by a desc limit 1; > > If there is a index on column 'a' but with ASC order. > The collation trait request will pass to TableScan, if there is an index > on the collation key. We just need to transform to an IndexScan with > reverse index scan direction, which is already possible in current Calcite. > > Similarly, the user query may be simple but the index on a is ASC order: > SELECT * from foo order by a desc; > > MV-based approach would have to register all the tablescan again with > reversed index scan direction. > > There is another useful feature, that is IndexApply. > > The rule of Join2IndexApply will transform a logical join to logical index > apply, then transform to physical indexed nested loop join. > select * from foo join bar on foo.a = bar.f; > There is an index on foo.a, but no index on bar.f. > > Nested Loop Inner Join > Join Filter: true > -> TableScan on bar > -> Index Scan using foo_index_a on foo >Index Cond: foo.a = bar.f > > MV-based approach would have to sort table bar if it want to use the index. > > But considering Ignite is a in-memory computing platform, MV-based > approach would be just good enough to serve its purpose. > > Looking at Drill's DbScanToIndexScanPrule.java, the way Apache Drill uses > to tackle with index is not wrong, but I would not recommend to follow > Drill's example to do it. > > Thanks, > Haisheng > > On 2020/06/01 19:15:32, Julian Hyde wrote: > > I'm pleased there are a variety of approaches. People should use > > whichever works for them and their use case. > > > > The so-called "rule-based" approach is definitely useful for > > OLTP-style queries where you are accessing a few rows and you need to > > plan quickly. > > > > Haisheng mentioned bitmap indexes a while back. Optimizing queries to > > use bitmap indexes is interesting because they can be combined in more > > ways than regular indexes. In this case, and many others, it is worth > > thinking about the problem at a high level. For example, combining two > > bitmap indexes can be modeled as a join, where the join keys are the > > record ids, and the record ids are sorted within each index value. > > Thinking at the high level can find plans that the rule-based approach > > will never find. > > > > Indexes-as-MVs,