[ https://issues.apache.org/jira/browse/CALCITE-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876825#comment-17876825 ]
Ian Bertolacci edited comment on CALCITE-5213 at 8/26/24 7:05 PM: ------------------------------------------------------------------ So I was able to quickly come up with some interesting data points. First and simplest: if you apply the rule onto the project nodes having the correlated subqueries, then you do get the expected tree; So I think thats the short-term route we will be going (doing some bottom-up correlated-subquery -> correlate node -> decorrelation processing of the tree) {code} Project($f0=[$3]) └── Correlate(correlation=[$cor0], joinType=[left], requiredColumns=[{2}]) ├── TableScan(table=[T3], Schema=[ID:Dimension, T3_measureColumn:Decimal(0, 38), T3_foreignKey:Dimension]) └── Aggregate(group=[{}], agg#0=[SUM($0)]) └── Project($f0=[+($1, $3)]) └── Correlate(correlation=[$cor1], joinType=[left], requiredColumns=[{2}]) ├── Filter(condition=[=($0, $cor0.T3_foreignKey)]) │ └── TableScan(table=[T2], Schema=[ID:Dimension, T2_measureColumn:Decimal(0, 38), T2_foreignKey:Dimension]) └── Aggregate(group=[{}], agg#0=[SUM($1)]) └── Filter(condition=[=($0, $cor1.T2_foreignKey)]) └── TableScan(table=[T1], Schema=[ID:Dimension, T1_measureColumn:Decimal(0, 38), T1_foreignKey:Dimension]) {code} Second: if you set the match limit to (or really any number < 2) it treats that subquery as uncorrelated, and puts a join-true in that position. {code} Project($f0=[$3]) └── Join(condition=[true], joinType=[left], variablesSet=[[$cor1, $cor0]]) ├── TableScan(table=[T3], Schema=[ID:Dimension, T3_measureColumn:Decimal(0, 38), T3_foreignKey:Dimension]) └── Aggregate(group=[{}], agg#0=[SUM($0)]) └── Project($f0=[+($1, $SCALAR_QUERY( │ Aggregate(group=[{}], agg#0=[SUM($1)]) │ └── Filter(condition=[=($0, $cor1.T2_foreignKey)]) │ └── TableScan(table=[T1], Schema=[ID:Dimension, T1_measureColumn:Decimal(0, 38), T1_foreignKey:Dimension]) │))]) └── Filter(condition=[=($0, $cor0.T3_foreignKey)]) └── TableScan(table=[T2], Schema=[ID:Dimension, T2_measureColumn:Decimal(0, 38), T2_foreignKey:Dimension]) {code} On a subsequent application of the rule, it would match in the inner subquery (now the most reachable subquery), recognizes it as a correlated subquery, and creates the correlate node. All of this confirms to me that these rules would benefit from some bottom-up conversion process, which is rather different how the planner operates today. was (Author: ian.bertolacci): So I was able to quickly come up with some interesting data points. First and simplest: if you apply the rule onto the project nodes having the correlated subqueries, then you do get the expected tree; So I think thats the short-term route we will be going (doing some bottom-up correlated-subquery -> correlate node -> decorrelation processing of the tree) {code} {code} Project($f0=[$3]) └── Correlate(correlation=[$cor0], joinType=[left], requiredColumns=[{2}]) ├── TableScan(table=[T3], Schema=[ID:Dimension, T3_measureColumn:Decimal(0, 38), T3_foreignKey:Dimension]) └── Aggregate(group=[{}], agg#0=[SUM($0)]) └── Project($f0=[+($1, $3)]) └── Correlate(correlation=[$cor1], joinType=[left], requiredColumns=[{2}]) ├── Filter(condition=[=($0, $cor0.T3_foreignKey)]) │ └── TableScan(table=[T2], Schema=[ID:Dimension, T2_measureColumn:Decimal(0, 38), T2_foreignKey:Dimension]) └── Aggregate(group=[{}], agg#0=[SUM($1)]) └── Filter(condition=[=($0, $cor1.T2_foreignKey)]) └── TableScan(table=[T1], Schema=[ID:Dimension, T1_measureColumn:Decimal(0, 38), T1_foreignKey:Dimension]) {code} Second: if you set the match limit to (or really any number < 2) it treats that subquery as uncorrelated, and puts a join-true in that position. {code} Project($f0=[$3]) └── Join(condition=[true], joinType=[left], variablesSet=[[$cor1, $cor0]]) ├── TableScan(table=[T3], Schema=[ID:Dimension, T3_measureColumn:Decimal(0, 38), T3_foreignKey:Dimension]) └── Aggregate(group=[{}], agg#0=[SUM($0)]) └── Project($f0=[+($1, $SCALAR_QUERY( │ Aggregate(group=[{}], agg#0=[SUM($1)]) │ └── Filter(condition=[=($0, $cor1.T2_foreignKey)]) │ └── TableScan(table=[T1], Schema=[ID:Dimension, T1_measureColumn:Decimal(0, 38), T1_foreignKey:Dimension]) │))]) └── Filter(condition=[=($0, $cor0.T3_foreignKey)]) └── TableScan(table=[T2], Schema=[ID:Dimension, T2_measureColumn:Decimal(0, 38), T2_foreignKey:Dimension]) {code} On a subsequent application of the rule, it would match in the inner subquery (now the most reachable subquery), recognizes it as a correlated subquery, and creates the correlate node. All of this confirms to me that these rules would benefit from some bottom-up conversion process, which is rather different how the planner operates today. > PROJECT_TO_SUBQUERY producing Incorrect tree from nested correlated > subqueries in projections with correlations in filters. > --------------------------------------------------------------------------------------------------------------------------- > > Key: CALCITE-5213 > URL: https://issues.apache.org/jira/browse/CALCITE-5213 > Project: Calcite > Issue Type: Bug > Affects Versions: 1.30.0, 1.36.0, 1.37.0 > Reporter: Ian Bertolacci > Priority: Major > > CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE produces (what I believe to be) > incorrect trees from nested correlated subqueries in projections. > I'm hoping that I'm just doing something wrong and maybe someone will point > it out. > For example: > {code:sql} > SELECT (SELECT Sum(C202 > + (SELECT Sum(C101) > FROM T1 > WHERE T1.ID = T2.C201)) > FROM T2 > WHERE T2.ID = T3.C302) > FROM T3 {code} > The initial RelNode tree produced from this SQL is: > {code} > 232:LogicalProject(EXPR$0=[ > | $SCALAR_QUERY({ > | LogicalAggregate(group=[{}], EXPR$0=[SUM($0)]) > | └──LogicalProject($f0=[+($2, > | | $SCALAR_QUERY({ > | | LogicalAggregate(group=[{}], EXPR$0=[SUM($0)]) > | | └──LogicalProject(C101=[$1]) > | | └──LogicalFilter(condition=[=($0, $cor1.C201)]) > | | └──TableScan(table=[[QUERY, T1]], fields=[[ID, C101]]) > | | }))]) > | └──LogicalFilter(condition=[=($0, $cor0.C302)]) > | └──TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, C203, > C204]]) > | })]) > └──223:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]]) > {code} > This looks ok so far, but it is important to notice the lack of variableSets > in the projection nodes (which would appear in the filter nodes having > correlated subqueries in their conditions). > After applying the CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE rule via a HEP > program the resulting tree is: > {code} > 270:LogicalProject(EXPR$0=[$3]) > | // Unexpected Join instead of correlate binding $cor0 > └──268:LogicalJoin(condition=[true], joinType=[left]) > ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]]) > └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)]) > └──283:LogicalProject($f0=[+($2, $5)]) > | // Correlate node correctly binding $cor1 > └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], > requiredColumns=[{1}]) > | // $cor0 is not bound by any parent correlate node > ├──262:LogicalFilter(condition=[=($0, $cor0.C302)]) > | └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, > C203, C204]]) > └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)]) > └──277:LogicalProject(C101=[$1]) > | // $cor1 bound by #281 > └──275:LogicalFilter(condition=[=($0, $cor1.C201)]) > └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, > C101]]) > {code} > > Essentially, instead of a correlate node binding $cor0 there is a join (#268) > and there is nothing binding $cor0. > I would have expected this: > {code} > 270:LogicalProject(EXPR$0=[$3]) > | // Correlate binding $cor0 and requiring C302 from the LHS (#246) > └──299:LogicalCorrelate(correlation=[$cor0], joinType=[left], > requiredColumns=[{2}]) > ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]]) > └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)]) > └──283:LogicalProject($f0=[+($2, $5)]) > | // Correlate node correctly binding $cor1 > └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], > requiredColumns=[{1}]) > | // $cor0 bound by #299 > ├──262:LogicalFilter(condition=[=($0, $cor0.C302)]) > | └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, > C203, C204]]) > └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)]) > └──277:LogicalProject(C101=[$1]) > | // $cor1 bound by #281 > └──275:LogicalFilter(condition=[=($0, $cor1.C201)]) > └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, > C101]]) > {code} > Further, when adding CoreRules.JOIN_TO_CORRELATE in an attempt to convert the > erroneous join into a correlation we get: > {code} > 322:LogicalProject(EXPR$0=[$3]) > | // Correlate incorrectly binding $cor2 (which does not appear anywhere) > | // when it should be binding $cor0, and requiring no columns from the left > side > └──324:LogicalCorrelate(correlation=[$cor2], joinType=[left], > requiredColumns=[{}]) > ├──298:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]]) > └──318:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)]) > └──337:LogicalProject($f0=[+($2, $5)]) > | // Correlate node correctly binding $cor1 > └──335:LogicalCorrelate(correlation=[$cor1], joinType=[left], > requiredColumns=[{1}]) > | // $cor0 is not bound by any parent correlate node > ├──314:LogicalFilter(condition=[=($0, $cor0.C302)]) > | └──299:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, > C203, C204]]) > └──333:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)]) > └──331:LogicalProject(C101=[$1]) > | // $cor1 bound by #335 > └──329:LogicalFilter(condition=[=($0, $cor1.C201)]) > └──301:TableScan(table=[[QUERY, T1]], fields=[[ID, > C101]]) > {code} > which does replace the join with a correlate, but the correlate is incorrect; > binding an undefined correlation variable and requiring no columns. > Even further, (and this might be a separate issue all together), applying > RelDecorrelator as an additional program in the sequence produces a very > incorrect tree (both with and without the JOIN_TO_CORRELATE rule): > {code} > 1041:LogicalProject(EXPR$0=[$5], ID6=[$4]) > └──1039:LogicalJoin(condition=[true], joinType=[left]) > ├──1006:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302, C303]]) > └──1037:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)]) > └──1035:LogicalProject(ID6=[$0], $f0=[+($2, $6)]) > └──1033:LogicalJoin(condition=[=($1, $5)], joinType=[left]) > ├──1024:LogicalFilter(condition=[=($0, $0)]) > | └──1007:TableScan(table=[[QUERY, T2]], > | fields=[[ID, C201, C202, C203, C204]]) > └──1031:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)]) > └──1029:LogicalProject(ID=[$0], C101=[$1]) > └──1027:LogicalFilter(condition=[=($0, $0)]) > └──1009:TableScan(table=[[QUERY, T1]], fields=[[ID, > C101]]) > {code} > This tree now has more projection expressions than the original query, which > is fully incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010)