[ https://issues.apache.org/jira/browse/CALCITE-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ian Bertolacci updated CALCITE-6504: ------------------------------------ Description: JOIN_SUB_QUERY_TO_CORRELATE/Join SubQueryRemoveRule produces incorrect trees when using a correlated sub-query as part of an equi-join condition of a join. For example: {code:sql} select * from T3 join T2 on (select max(id) from T1 where T3.C301 = T1.id) = T2.C201 {code} Has the initial tree: {code:none} 20:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 19:Join(condition=[=(SCALAR_SUBQUERY{ │ 18:Aggregate(group=[{}], EXPR$0=[MAX($0)]) │ └── 17:Project(ID=[$0]) │ └── 16:Filter(condition=[=($cor0.C301, $0)]) │ └── 15:TableScan(table=[T1], Schema=[ID:Dimension, C101:Decimal(0)]) │}, $4)], joinType=[inner]) ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, C302:Dimension]) └── 14:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) {code} Using only the JOIN_SUB_QUERY_TO_CORRELATE rules in a hep program in planning, this is the resulting tree: {code:java} 25:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 41:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 39:Join(condition=[=($8, $4)], joinType=[inner]) ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, C302:Dimension]) └── 37:Correlate(correlation=[$cor0], joinType=[left], requiredColumns=[{1}]) ├── 14:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) └── 33:Project(ID=[$0]) └── 31:Filter(condition=[=($cor0.C301, $0)]) └── 15:TableScan(table=[T1], Schema=[ID:Dimension, C101:Decimal(0)]) {code} Notice that the original correlation expression is between T1 and T3, but the rule has created a correlate between T1 and T2. I would have expected this tree: {code:none} 25:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 41:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 39:Join(condition=[=($3, $4)], joinType=[inner]) ├── 37:Correlate(correlation=[$cor0], joinType=[left], requiredColumns=[{1}]) │ ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, C302:Dimension]) │ └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) │ └── 33:Project(ID=[$0]) │ └── 31:Filter(condition=[=($cor0.C301, $0)]) │ └── 15:TableScan(table=[T1], Schema=[ID:Dimension, C101:Decimal(0)]) └── 14:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) {code} Additionally if you swap the sides of the join (making {{{}`T2 join T3`{}}}), you do get the correct association, but the correlate has an invalid `requiredColumn`: {code:none} 25:Project(ID=[$0], C201=[$1], C202=[$2], C203=[$3], C204=[$4], ID0=[$5], C301=[$6], C302=[$7]) └── 41:Project(ID=[$0], C201=[$1], C202=[$2], C203=[$3], C204=[$4], ID0=[$5], C301=[$6], C302=[$7]) └── 39:Join(condition=[=($8, $1)], joinType=[inner]) ├── 13:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) └── 37:Correlate(correlation=[$cor0], joinType=[left], requiredColumns=[{6}]) ├── 14:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, C302:Dimension]) └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) └── 33:Project(ID=[$0]) └── 31:Filter(condition=[=($cor0.C301, $0)]) └── 15:TableScan(table=[T1], Schema=[ID:Dimension, C101:Decimal(0)]) {code} Here `requiredColumn` should be 1 (pointing to `C301`) but is actually 6, which is where `C301` would be after the join {{`T2 join T3`}} was: JOIN_SUB_QUERY_TO_CORRELATE/Join SubQueryRemoveRule produces incorrect trees when using a correlated sub-query as part of an equi-join condition of a join. For example: {code:sql} select * from T3 join T2 on (select max(id) from T1 where T3.C301 = T1.id) = T2.C201 {code} Has the initial tree: {code:none} 20:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 19:Join(condition=[=(SCALAR_SUBQUERY{ │ 18:Aggregate(group=[{}], EXPR$0=[MAX($0)]) │ └── 17:Project(ID=[$0]) │ └── 16:Filter(condition=[=($cor0.C301, $0)]) │ └── 15:TableScan(table=[T1], Schema=[ID:Dimension, C101:Decimal(0)]) │}, $4)], joinType=[inner]) ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, C302:Dimension]) └── 14:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) {code} Using only the JOIN_SUB_QUERY_TO_CORRELATE rules in a hep program in planning, this is the resulting tree: {code} 25:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 41:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 39:Join(condition=[=($8, $4)], joinType=[inner]) ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, C302:Dimension]) └── 37:Correlate(correlation=[$cor0], joinType=[left], requiredColumns=[{1}]) ├── 14:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) └── 33:Project(ID=[$0]) └── 31:Filter(condition=[=($cor0.C301, $0)]) └── 15:TableScan(table=[T1], Schema=[ID:Dimension, C101:Decimal(0)]) {code} Notice that the original correlation expression is between T1 and T3, but the rule as created a correlate between T1 and T2. I would have expected this tree: {code:none} 25:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 41:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], C203=[$6], C204=[$7]) └── 39:Join(condition=[=($3, $4)], joinType=[inner]) ├── 37:Correlate(correlation=[$cor0], joinType=[left], requiredColumns=[{1}]) │ ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, C302:Dimension]) │ └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) │ └── 33:Project(ID=[$0]) │ └── 31:Filter(condition=[=($cor0.C301, $0)]) │ └── 15:TableScan(table=[T1], Schema=[ID:Dimension, C101:Decimal(0)]) └── 14:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) {code} If you swap the sides of the join (making {{`T2 join T3`}}), you do get the correct association, but the correlate has an invalid `requiredColumn`: {code:none} 25:Project(ID=[$0], C201=[$1], C202=[$2], C203=[$3], C204=[$4], ID0=[$5], C301=[$6], C302=[$7]) └── 41:Project(ID=[$0], C201=[$1], C202=[$2], C203=[$3], C204=[$4], ID0=[$5], C301=[$6], C302=[$7]) └── 39:Join(condition=[=($8, $1)], joinType=[inner]) ├── 13:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) └── 37:Correlate(correlation=[$cor0], joinType=[left], requiredColumns=[{6}]) ├── 14:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, C302:Dimension]) └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) └── 33:Project(ID=[$0]) └── 31:Filter(condition=[=($cor0.C301, $0)]) └── 15:TableScan(table=[T1], Schema=[ID:Dimension, C101:Decimal(0)]) {code} Here `requiredColumn` should be 1 (pointing to `C301`) but is actually 6, which is where `C301` would be after the join {{`T2 join T3`}} > JOIN_SUB_QUERY_TO_CORRELATE/Join SubQueryRemoveRule produces incorrect tree > when using correlated sub-query in on clause of equi-join > ------------------------------------------------------------------------------------------------------------------------------------- > > Key: CALCITE-6504 > URL: https://issues.apache.org/jira/browse/CALCITE-6504 > Project: Calcite > Issue Type: Bug > Reporter: Ian Bertolacci > Priority: Major > > JOIN_SUB_QUERY_TO_CORRELATE/Join SubQueryRemoveRule produces incorrect trees > when using a correlated sub-query as part of an equi-join condition of a join. > For example: > {code:sql} > select * from T3 join T2 on (select max(id) from T1 where T3.C301 = T1.id) = > T2.C201 > {code} > Has the initial tree: > {code:none} > 20:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], > C203=[$6], C204=[$7]) > └── 19:Join(condition=[=(SCALAR_SUBQUERY{ > │ 18:Aggregate(group=[{}], EXPR$0=[MAX($0)]) > │ └── 17:Project(ID=[$0]) > │ └── 16:Filter(condition=[=($cor0.C301, $0)]) > │ └── 15:TableScan(table=[T1], Schema=[ID:Dimension, > C101:Decimal(0)]) > │}, $4)], joinType=[inner]) > ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, > C302:Dimension]) > └── 14:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, > C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) > {code} > Using only the JOIN_SUB_QUERY_TO_CORRELATE rules in a hep program in > planning, this is the resulting tree: > {code:java} > 25:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], > C203=[$6], C204=[$7]) > └── 41:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], > C203=[$6], C204=[$7]) > └── 39:Join(condition=[=($8, $4)], joinType=[inner]) > ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, > C302:Dimension]) > └── 37:Correlate(correlation=[$cor0], joinType=[left], > requiredColumns=[{1}]) > ├── 14:TableScan(table=[T2], Schema=[ID:Dimension, > C201:Dimension, C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) > └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) > └── 33:Project(ID=[$0]) > └── 31:Filter(condition=[=($cor0.C301, $0)]) > └── 15:TableScan(table=[T1], Schema=[ID:Dimension, > C101:Decimal(0)]) > {code} > Notice that the original correlation expression is between T1 and T3, but the > rule has created a correlate between T1 and T2. > I would have expected this tree: > {code:none} > 25:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], > C203=[$6], C204=[$7]) > └── 41:Project(ID=[$0], C301=[$1], C302=[$2], ID0=[$3], C201=[$4], C202=[$5], > C203=[$6], C204=[$7]) > └── 39:Join(condition=[=($3, $4)], joinType=[inner]) > ├── 37:Correlate(correlation=[$cor0], joinType=[left], > requiredColumns=[{1}]) > │ ├── 13:TableScan(table=[T3], Schema=[ID:Dimension, > C301:Dimension, C302:Dimension]) > │ └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) > │ └── 33:Project(ID=[$0]) > │ └── 31:Filter(condition=[=($cor0.C301, $0)]) > │ └── 15:TableScan(table=[T1], Schema=[ID:Dimension, > C101:Decimal(0)]) > └── 14:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, > C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) > {code} > > Additionally if you swap the sides of the join (making {{{}`T2 join T3`{}}}), > you do get the correct association, but the correlate has an invalid > `requiredColumn`: > {code:none} > 25:Project(ID=[$0], C201=[$1], C202=[$2], C203=[$3], C204=[$4], ID0=[$5], > C301=[$6], C302=[$7]) > └── 41:Project(ID=[$0], C201=[$1], C202=[$2], C203=[$3], C204=[$4], ID0=[$5], > C301=[$6], C302=[$7]) > └── 39:Join(condition=[=($8, $1)], joinType=[inner]) > ├── 13:TableScan(table=[T2], Schema=[ID:Dimension, C201:Dimension, > C202:Decimal(0), C203:Decimal(0), C204:Decimal(2)]) > └── 37:Correlate(correlation=[$cor0], joinType=[left], > requiredColumns=[{6}]) > ├── 14:TableScan(table=[T3], Schema=[ID:Dimension, > C301:Dimension, C302:Dimension]) > └── 35:Aggregate(group=[{}], EXPR$0=[MAX($0)]) > └── 33:Project(ID=[$0]) > └── 31:Filter(condition=[=($cor0.C301, $0)]) > └── 15:TableScan(table=[T1], Schema=[ID:Dimension, > C101:Decimal(0)]) > {code} > Here `requiredColumn` should be 1 (pointing to `C301`) but is actually 6, > which is where `C301` would be after the join {{`T2 join T3`}} -- This message was sent by Atlassian Jira (v8.20.10#820010)