[ 
https://issues.apache.org/jira/browse/CALCITE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash updated CALCITE-7622:
--------------------------
    Description: 
JoinProjectTransposeRule must not match SEMI / ANTI join

 

The fundamental problem is that {{JoinProjectTransposeRule}} was designed 
around *regular joins* (INNER/OUTER), where the output row type is always 
{{{}left columns + right columns{}}}. SEMI and ANTI joins violate this 
assumption in a structural way.
h4. 1. The rule computes {{joinChildrenRowType}} with hardcoded {{INNER}}
{code:java}
final RelDataType joinChildrenRowType = SqlValidatorUtil.deriveJoinRowType( 
leftJoinChild.getRowType(), rightJoinChild.getRowType(), JoinRelType.INNER, // 
<-- always INNER ...);{code}
This is intentional for INNER/OUTER: it's building a "scratch" merged type to 
construct the RexProgram over both sides. But the assumption that {{left fields 
+ right fields = join output fields}} is {*}only true for non-SEMI/ANTI 
types{*}.
h4. 2. SEMI and ANTI drop the right side from their output row type

>From {{{}SqlValidatorUtil.deriveJoinRowType{}}}:
{code:java}
case SEMI: case ANTI: rightType = null; // right side is GONE from the output 
break;{code}
So when the rule later does:
{code:java}
final int nProjExprs = join.getRowType().getFieldCount();{code}
For a SEMI/ANTI join, {{join.getRowType()}} only has {*}left-side fields{*}. 
But the rule calculated {{projects}} over {{left fields + right fields}} (the 
INNER-typed scratch type). The counts don't match.
h4. 3. The resulting {{newJoin}} has a structurally wrong row type

The rule creates:
{code:java}
final Join newJoin = join.copy(join.getTraitSet(), newCondition, leftJoinChild, 
rightJoinChild, join.getJoinType(), // SEMI or ANTI preserved here 
join.isSemiJoinDone());{code}
{{newJoin.getRowType()}} is re-derived via {{deriveJoinRowType}} with the 
actual SEMI/ANTI type — so it only has left-side fields. But the projection 
list {{newProjExprs}} was built assuming left + right fields exist. The loop:
 
{code:java}
for (int i = 0; i < nProjExprs; i++) { 
  RexNode newExpr = mergedProgram.expandLocalRef(projList.get(i)); ... 
newProjExprs.add(newExpr); 
}{code}
...will contain {{{}RexInputRef{}}}s pointing to right-side field indices that 
no longer exist in {{{}newJoin{}}}'s row type. This causes either:
 * An {{IndexOutOfBoundsException}} at planning time, or
 * A silent plan corruption where the wrong fields get referenced

h4. 4. The {{isOuterJoin()}} adjustment path is also skipped

For OUTER joins, there's a correction step:
{code:java}
if (joinType.isOuterJoin()) { newExpr = newExpr.accept(new 
RelOptUtil.RexInputConverter(...)); }{code}
SEMI/ANTI return {{false}} for {{{}isOuterJoin(){}}}, so this adjustment never 
runs. Even if the field counts somehow survived, the {{RexInputRef}} indices 
would still be wrong relative to the new join's output.
h4. 5. The guards that exist don't help for SEMI/ANTI

The only type-based guards in {{onMatch}} are:
{code:java}
!joinType.generatesNullsOnLeft() // false for SEMI and ANTI 
!joinType.generatesNullsOnRight() // false for SEMI and ANTI{code}
Since both return {{false}} for SEMI/ANTI, {*}neither project gets 
suppressed{*}. The rule merrily picks up both the left and right project 
children and proceeds into the broken code path.
----
h3. Why the right-side project is especially dangerous

For a plan like:
{code:java}
Project(a, b) 
└─ Join[SEMI] 
    ├─ Project(a, b, c) ← left 
    └─ Project(x, y) ← right{code}
The right-side project is *semantically invisible* — SEMI join consumers never 
see right-side columns. But the rule pulls it up anyway, constructing a merged 
program that references right-side field indices. After the rule fires, the 
plan references columns that the SEMI join doesn't output.

  was:JoinProjectTransposeRule must not match SEMI / ANTI join


> Don't fire JoinProjectTransposeRule for ANTI/SEMI/LEFT_MARK JOIN
> ----------------------------------------------------------------
>
>                 Key: CALCITE-7622
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7622
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.42.0
>            Reporter: Yash
>            Assignee: Yash
>            Priority: Minor
>              Labels: pull-request-available
>
> JoinProjectTransposeRule must not match SEMI / ANTI join
>  
> The fundamental problem is that {{JoinProjectTransposeRule}} was designed 
> around *regular joins* (INNER/OUTER), where the output row type is always 
> {{{}left columns + right columns{}}}. SEMI and ANTI joins violate this 
> assumption in a structural way.
> h4. 1. The rule computes {{joinChildrenRowType}} with hardcoded {{INNER}}
> {code:java}
> final RelDataType joinChildrenRowType = SqlValidatorUtil.deriveJoinRowType( 
> leftJoinChild.getRowType(), rightJoinChild.getRowType(), JoinRelType.INNER, 
> // <-- always INNER ...);{code}
> This is intentional for INNER/OUTER: it's building a "scratch" merged type to 
> construct the RexProgram over both sides. But the assumption that {{left 
> fields + right fields = join output fields}} is {*}only true for 
> non-SEMI/ANTI types{*}.
> h4. 2. SEMI and ANTI drop the right side from their output row type
> From {{{}SqlValidatorUtil.deriveJoinRowType{}}}:
> {code:java}
> case SEMI: case ANTI: rightType = null; // right side is GONE from the output 
> break;{code}
> So when the rule later does:
> {code:java}
> final int nProjExprs = join.getRowType().getFieldCount();{code}
> For a SEMI/ANTI join, {{join.getRowType()}} only has {*}left-side fields{*}. 
> But the rule calculated {{projects}} over {{left fields + right fields}} (the 
> INNER-typed scratch type). The counts don't match.
> h4. 3. The resulting {{newJoin}} has a structurally wrong row type
> The rule creates:
> {code:java}
> final Join newJoin = join.copy(join.getTraitSet(), newCondition, 
> leftJoinChild, rightJoinChild, join.getJoinType(), // SEMI or ANTI preserved 
> here join.isSemiJoinDone());{code}
> {{newJoin.getRowType()}} is re-derived via {{deriveJoinRowType}} with the 
> actual SEMI/ANTI type — so it only has left-side fields. But the projection 
> list {{newProjExprs}} was built assuming left + right fields exist. The loop:
>  
> {code:java}
> for (int i = 0; i < nProjExprs; i++) { 
>   RexNode newExpr = mergedProgram.expandLocalRef(projList.get(i)); ... 
> newProjExprs.add(newExpr); 
> }{code}
> ...will contain {{{}RexInputRef{}}}s pointing to right-side field indices 
> that no longer exist in {{{}newJoin{}}}'s row type. This causes either:
>  * An {{IndexOutOfBoundsException}} at planning time, or
>  * A silent plan corruption where the wrong fields get referenced
> h4. 4. The {{isOuterJoin()}} adjustment path is also skipped
> For OUTER joins, there's a correction step:
> {code:java}
> if (joinType.isOuterJoin()) { newExpr = newExpr.accept(new 
> RelOptUtil.RexInputConverter(...)); }{code}
> SEMI/ANTI return {{false}} for {{{}isOuterJoin(){}}}, so this adjustment 
> never runs. Even if the field counts somehow survived, the {{RexInputRef}} 
> indices would still be wrong relative to the new join's output.
> h4. 5. The guards that exist don't help for SEMI/ANTI
> The only type-based guards in {{onMatch}} are:
> {code:java}
> !joinType.generatesNullsOnLeft() // false for SEMI and ANTI 
> !joinType.generatesNullsOnRight() // false for SEMI and ANTI{code}
> Since both return {{false}} for SEMI/ANTI, {*}neither project gets 
> suppressed{*}. The rule merrily picks up both the left and right project 
> children and proceeds into the broken code path.
> ----
> h3. Why the right-side project is especially dangerous
> For a plan like:
> {code:java}
> Project(a, b) 
> └─ Join[SEMI] 
>     ├─ Project(a, b, c) ← left 
>     └─ Project(x, y) ← right{code}
> The right-side project is *semantically invisible* — SEMI join consumers 
> never see right-side columns. But the rule pulls it up anyway, constructing a 
> merged program that references right-side field indices. After the rule 
> fires, the plan references columns that the SEMI join doesn't output.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to