[ 
https://issues.apache.org/jira/browse/CALCITE-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Bertolacci updated CALCITE-5213:
------------------------------------
    Description: 
CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE produces (what I believe to be) 
incorrect trees from nested correlated subqueries in projections.

I'm hoping that I'm just doing something wrong and maybe someone will point it 
out.

For example:
{code:sql}
SELECT (SELECT Sum(C202
                   + (SELECT Sum(C101)
                      FROM   T1
                      WHERE  T1.ID = T2.C201))
        FROM   T2
        WHERE  T2.ID = T3.C302)
FROM   T3  {code}

The initial RelNode tree produced from this SQL is:
{code}
232:LogicalProject(EXPR$0=[
|  $SCALAR_QUERY({
|    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
|    └──LogicalProject($f0=[+($2, 
|       |  $SCALAR_QUERY({
|       |    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
|       |    └──LogicalProject(C101=[$1])
|       |       └──LogicalFilter(condition=[=($0, $cor1.C201)])
|       |          └──TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
|       |  }))])
|       └──LogicalFilter(condition=[=($0, $cor0.C302)])
|          └──TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, C203, 
C204]])
|  })])
└──223:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
{code}
This looks ok so far, but it is important to notice the lack of variableSets in 
the projection nodes (which would appear in the filter nodes having correlated 
subqueries in their conditions).

After applying the CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE rule via a HEP 
program the resulting tree is:
{code}
270:LogicalProject(EXPR$0=[$3])
|   // Unexpected Join instead of correlate binding $cor0
└──268:LogicalJoin(condition=[true], joinType=[left])
   ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
   └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      └──283:LogicalProject($f0=[+($2, $5)])
         |  // Correlate node correctly binding $cor1
         └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
requiredColumns=[{1}])
            |  // $cor0 is not bound by any parent correlate node
            ├──262:LogicalFilter(condition=[=($0, $cor0.C302)])
            |  └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
C203, C204]])
            └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
               └──277:LogicalProject(C101=[$1])
                  |  // $cor1 bound by #281
                  └──275:LogicalFilter(condition=[=($0, $cor1.C201)])
                     └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
{code}
 
Essentially, instead of a correlate node binding $cor0 there is a join (#268) 
and there is nothing binding $cor0.

I would have expected this:
{code}
270:LogicalProject(EXPR$0=[$3])
|  // Correlate binding $cor0 and requiring C302 from the LHS (#246)
└──299:LogicalCorrelate(correlation=[$cor0], joinType=[left], 
requiredColumns=[{2}])
   ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
   └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      └──283:LogicalProject($f0=[+($2, $5)])
         |  // Correlate node correctly binding $cor1
         └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
requiredColumns=[{1}])
            |  // $cor0 bound by #299
            ├──262:LogicalFilter(condition=[=($0, $cor0.C302)])
            |  └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
C203, C204]])
            └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
               └──277:LogicalProject(C101=[$1])
                  |  // $cor1 bound by #281
                  └──275:LogicalFilter(condition=[=($0, $cor1.C201)])
                     └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
{code}

Further, when adding CoreRules.JOIN_TO_CORRELATE in an attempt to convert the 
erroneous join into a correlation we get:
{code}
322:LogicalProject(EXPR$0=[$3])
|  // Correlate incorrectly binding $cor2 (which does not appear anywhere)
|  // when it should be binding $cor0, and requiring no columns from the left 
side
└──324:LogicalCorrelate(correlation=[$cor2], joinType=[left], 
requiredColumns=[{}])
   ├──298:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
   └──318:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      └──337:LogicalProject($f0=[+($2, $5)])
         |  // Correlate node correctly binding $cor1
         └──335:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
requiredColumns=[{1}])
            |  // $cor0 is not bound by any parent correlate node
            ├──314:LogicalFilter(condition=[=($0, $cor0.C302)])
            |  └──299:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
C203, C204]])
            └──333:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
               └──331:LogicalProject(C101=[$1])
                  |  // $cor1 bound by #335
                  └──329:LogicalFilter(condition=[=($0, $cor1.C201)])
                     └──301:TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
{code}

which does replace the join with a correlate, but the correlate is incorrect; 
binding an undefined correlation variable and requiring no columns.

Even further, (and this might be a separate issue all together), applying 
RelDecorrelator as an additional program in the sequence produces a very 
incorrect tree (both with and without the JOIN_TO_CORRELATE rule):
{code}
1041:LogicalProject(EXPR$0=[$5], ID6=[$4])
└──1039:LogicalJoin(condition=[true], joinType=[left])
   ├──1006:QueryTableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302, C303]])
   └──1037:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)])
      └──1035:LogicalProject(ID6=[$0], $f0=[+($2, $6)])
         └──1033:LogicalJoin(condition=[=($1, $5)], joinType=[left])
            ├──1024:LogicalFilter(condition=[=($0, $0)])
            |  └──1007:QueryTableScan(table=[[QUERY, T2]], fields=[[ID, C201, 
C202, C203, C204]])
            └──1031:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)])
               └──1029:LogicalProject(ID=[$0], C101=[$1])
                  └──1027:LogicalFilter(condition=[=($0, $0)])
                     └──1009:QueryTableScan(table=[[QUERY, T1]], fields=[[ID, 
C101]])
{code}

This tree now has more projection expressions than the original query, which is 
fully incorrect.



  was:
One of the  "standard" subquery to correlate rules 
(CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE, 
CoreRules.FILTER_SUB_QUERY_TO_CORRELATE and 
CoreRules.JOIN_SUB_QUERY_TO_CORRELATE), the resulting tree is:
{code}produces (what I believe to be) incorrect trees from nested correlated 
subqueries in projections.

I'm hoping that I'm just doing something wrong and maybe someone will point it 
out.

For example:
{code:sql}
SELECT (SELECT Sum(C202
                   + (SELECT Sum(C101)
                      FROM   T1
                      WHERE  T1.ID = T2.C201))
        FROM   T2
        WHERE  T2.ID = T3.C302)
FROM   T3  {code}

The initial RelNode tree produced from this SQL is:
{code}
232:LogicalProject(EXPR$0=[
|  $SCALAR_QUERY({
|    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
|    └──LogicalProject($f0=[+($2, 
|       |  $SCALAR_QUERY({
|       |    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
|       |    └──LogicalProject(C101=[$1])
|       |       └──LogicalFilter(condition=[=($0, $cor1.C201)])
|       |          └──TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
|       |  }))])
|       └──LogicalFilter(condition=[=($0, $cor0.C302)])
|          └──TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, C203, 
C204]])
|  })])
└──223:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
{code}
This looks ok so far, but it is important to notice the lack of variableSets in 
the projection nodes (which would appear in the filter nodes having correlated 
subqueries in their conditions).

After applying the CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE rule via a HEP 
program the resulting tree is:
{code}
270:LogicalProject(EXPR$0=[$3])
|   // Unexpected Join instead of correlate binding $cor0
└──268:LogicalJoin(condition=[true], joinType=[left])
   ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
   └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      └──283:LogicalProject($f0=[+($2, $5)])
         |  // Correlate node correctly binding $cor1
         └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
requiredColumns=[{1}])
            |  // $cor0 is not bound by any parent correlate node
            ├──262:LogicalFilter(condition=[=($0, $cor0.C302)])
            |  └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
C203, C204]])
            └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
               └──277:LogicalProject(C101=[$1])
                  |  // $cor1 bound by #281
                  └──275:LogicalFilter(condition=[=($0, $cor1.C201)])
                     └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
{code}
 
Essentially, instead of a correlate node binding $cor0 there is a join (#268) 
and there is nothing binding $cor0.

I would have expected this:
{code}
270:LogicalProject(EXPR$0=[$3])
|  // Correlate binding $cor0 and requiring C302 from the LHS (#246)
└──299:LogicalCorrelate(correlation=[$cor0], joinType=[left], 
requiredColumns=[{2}])
   ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
   └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      └──283:LogicalProject($f0=[+($2, $5)])
         |  // Correlate node correctly binding $cor1
         └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
requiredColumns=[{1}])
            |  // $cor0 bound by #299
            ├──262:LogicalFilter(condition=[=($0, $cor0.C302)])
            |  └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
C203, C204]])
            └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
               └──277:LogicalProject(C101=[$1])
                  |  // $cor1 bound by #281
                  └──275:LogicalFilter(condition=[=($0, $cor1.C201)])
                     └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
{code}

Further, when adding CoreRules.JOIN_TO_CORRELATE in an attempt to convert the 
erroneous join into a correlation we get:
{code}
322:LogicalProject(EXPR$0=[$3])
|  // Correlate incorrectly binding $cor2 (which does not appear anywhere)
|  // when it should be binding $cor0, and requiring no columns from the left 
side
└──324:LogicalCorrelate(correlation=[$cor2], joinType=[left], 
requiredColumns=[{}])
   ├──298:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
   └──318:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      └──337:LogicalProject($f0=[+($2, $5)])
         |  // Correlate node correctly binding $cor1
         └──335:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
requiredColumns=[{1}])
            |  // $cor0 is not bound by any parent correlate node
            ├──314:LogicalFilter(condition=[=($0, $cor0.C302)])
            |  └──299:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
C203, C204]])
            └──333:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
               └──331:LogicalProject(C101=[$1])
                  |  // $cor1 bound by #335
                  └──329:LogicalFilter(condition=[=($0, $cor1.C201)])
                     └──301:TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
{code}

which does replace the join with a correlate, but the correlate is incorrect; 
binding an undefined correlation variable and requiring no columns.

Even further, (and this might be a separate issue all together), applying 
RelDecorrelator as an additional program in the sequence produces a very 
incorrect tree (both with and without the JOIN_TO_CORRELATE rule):
{code}
1041:LogicalProject(EXPR$0=[$5], ID6=[$4])
└──1039:LogicalJoin(condition=[true], joinType=[left])
   ├──1006:QueryTableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302, C303]])
   └──1037:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)])
      └──1035:LogicalProject(ID6=[$0], $f0=[+($2, $6)])
         └──1033:LogicalJoin(condition=[=($1, $5)], joinType=[left])
            ├──1024:LogicalFilter(condition=[=($0, $0)])
            |  └──1007:QueryTableScan(table=[[QUERY, T2]], fields=[[ID, C201, 
C202, C203, C204]])
            └──1031:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)])
               └──1029:LogicalProject(ID=[$0], C101=[$1])
                  └──1027:LogicalFilter(condition=[=($0, $0)])
                     └──1009:QueryTableScan(table=[[QUERY, T1]], fields=[[ID, 
C101]])
{code}

This tree now has more projection expressions than the original query, which is 
fully incorrect.




> PROJECT_TO_SUBQUERY producing Incorrect tree from nested correlated 
> subqueries in projections with correlations in filters.
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-5213
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5213
>             Project: Calcite
>          Issue Type: Bug
>    Affects Versions: 1.30.0
>            Reporter: Ian Bertolacci
>            Priority: Major
>
> CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE produces (what I believe to be) 
> incorrect trees from nested correlated subqueries in projections.
> I'm hoping that I'm just doing something wrong and maybe someone will point 
> it out.
> For example:
> {code:sql}
> SELECT (SELECT Sum(C202
>                    + (SELECT Sum(C101)
>                       FROM   T1
>                       WHERE  T1.ID = T2.C201))
>         FROM   T2
>         WHERE  T2.ID = T3.C302)
> FROM   T3  {code}
> The initial RelNode tree produced from this SQL is:
> {code}
> 232:LogicalProject(EXPR$0=[
> |  $SCALAR_QUERY({
> |    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
> |    └──LogicalProject($f0=[+($2, 
> |       |  $SCALAR_QUERY({
> |       |    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
> |       |    └──LogicalProject(C101=[$1])
> |       |       └──LogicalFilter(condition=[=($0, $cor1.C201)])
> |       |          └──TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
> |       |  }))])
> |       └──LogicalFilter(condition=[=($0, $cor0.C302)])
> |          └──TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, C203, 
> C204]])
> |  })])
> └──223:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
> {code}
> This looks ok so far, but it is important to notice the lack of variableSets 
> in the projection nodes (which would appear in the filter nodes having 
> correlated subqueries in their conditions).
> After applying the CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE rule via a HEP 
> program the resulting tree is:
> {code}
> 270:LogicalProject(EXPR$0=[$3])
> |   // Unexpected Join instead of correlate binding $cor0
> └──268:LogicalJoin(condition=[true], joinType=[left])
>    ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
>    └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>       └──283:LogicalProject($f0=[+($2, $5)])
>          |  // Correlate node correctly binding $cor1
>          └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
> requiredColumns=[{1}])
>             |  // $cor0 is not bound by any parent correlate node
>             ├──262:LogicalFilter(condition=[=($0, $cor0.C302)])
>             |  └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
> C203, C204]])
>             └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>                └──277:LogicalProject(C101=[$1])
>                   |  // $cor1 bound by #281
>                   └──275:LogicalFilter(condition=[=($0, $cor1.C201)])
>                      └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, 
> C101]])
> {code}
>  
> Essentially, instead of a correlate node binding $cor0 there is a join (#268) 
> and there is nothing binding $cor0.
> I would have expected this:
> {code}
> 270:LogicalProject(EXPR$0=[$3])
> |  // Correlate binding $cor0 and requiring C302 from the LHS (#246)
> └──299:LogicalCorrelate(correlation=[$cor0], joinType=[left], 
> requiredColumns=[{2}])
>    ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
>    └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>       └──283:LogicalProject($f0=[+($2, $5)])
>          |  // Correlate node correctly binding $cor1
>          └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
> requiredColumns=[{1}])
>             |  // $cor0 bound by #299
>             ├──262:LogicalFilter(condition=[=($0, $cor0.C302)])
>             |  └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
> C203, C204]])
>             └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>                └──277:LogicalProject(C101=[$1])
>                   |  // $cor1 bound by #281
>                   └──275:LogicalFilter(condition=[=($0, $cor1.C201)])
>                      └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, 
> C101]])
> {code}
> Further, when adding CoreRules.JOIN_TO_CORRELATE in an attempt to convert the 
> erroneous join into a correlation we get:
> {code}
> 322:LogicalProject(EXPR$0=[$3])
> |  // Correlate incorrectly binding $cor2 (which does not appear anywhere)
> |  // when it should be binding $cor0, and requiring no columns from the left 
> side
> └──324:LogicalCorrelate(correlation=[$cor2], joinType=[left], 
> requiredColumns=[{}])
>    ├──298:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
>    └──318:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>       └──337:LogicalProject($f0=[+($2, $5)])
>          |  // Correlate node correctly binding $cor1
>          └──335:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
> requiredColumns=[{1}])
>             |  // $cor0 is not bound by any parent correlate node
>             ├──314:LogicalFilter(condition=[=($0, $cor0.C302)])
>             |  └──299:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
> C203, C204]])
>             └──333:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>                └──331:LogicalProject(C101=[$1])
>                   |  // $cor1 bound by #335
>                   └──329:LogicalFilter(condition=[=($0, $cor1.C201)])
>                      └──301:TableScan(table=[[QUERY, T1]], fields=[[ID, 
> C101]])
> {code}
> which does replace the join with a correlate, but the correlate is incorrect; 
> binding an undefined correlation variable and requiring no columns.
> Even further, (and this might be a separate issue all together), applying 
> RelDecorrelator as an additional program in the sequence produces a very 
> incorrect tree (both with and without the JOIN_TO_CORRELATE rule):
> {code}
> 1041:LogicalProject(EXPR$0=[$5], ID6=[$4])
> └──1039:LogicalJoin(condition=[true], joinType=[left])
>    ├──1006:QueryTableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302, 
> C303]])
>    └──1037:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)])
>       └──1035:LogicalProject(ID6=[$0], $f0=[+($2, $6)])
>          └──1033:LogicalJoin(condition=[=($1, $5)], joinType=[left])
>             ├──1024:LogicalFilter(condition=[=($0, $0)])
>             |  └──1007:QueryTableScan(table=[[QUERY, T2]], fields=[[ID, C201, 
> C202, C203, C204]])
>             └──1031:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)])
>                └──1029:LogicalProject(ID=[$0], C101=[$1])
>                   └──1027:LogicalFilter(condition=[=($0, $0)])
>                      └──1009:QueryTableScan(table=[[QUERY, T1]], fields=[[ID, 
> C101]])
> {code}
> This tree now has more projection expressions than the original query, which 
> is fully incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to