[ 
https://issues.apache.org/jira/browse/CALCITE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darpan Lunagariya (e6data computing) updated CALCITE-7551:
----------------------------------------------------------
    Description: 
Several optimizer and {{RelBuilder}}-level transformations inline a 
lower-projection expression into multiple positions of an upper expression 
without checking whether the inlined expression is non-deterministic. When the 
lower expression is non-deterministic in the strict sense (i.e. two calls 
within the same statement may return different values — {{RAND()}}, sequence 
{{NEXTVAL}}, UUID generators), this silently converts one evaluation into many, 
changing query semantics.

{{CURRENT_TIMESTAMP}} and the other time functions are not in scope: Calcite 
considers them deterministic per statement (see the "Note on dynamic vs 
non-deterministic" section below) and they are safe to duplicate.

h3. Minimal repro

{code:sql}
SELECT a, a + 1 AS b
FROM (SELECT rand() AS a)
{code}

h3. Current plan produced by {{SqlToRelConverter}}:

{noformat}
LogicalProject(A=[RAND()], B=[+(RAND(), 1)])
LogicalValues(tuples=[[{ 0 }]])
{noformat}

The two {{RAND()}} calls are evaluated independently, so {{B - A}} is no longer 
always {{1}}. The expected plan preserves the single evaluation by referencing 
the projected column:

{noformat}
LogicalProject(A=[$0], B=[+($0, 1)])
LogicalProject(A=[RAND()])
    LogicalValues(tuples=[[{ 0 }]])
{noformat}

h3. Affected code paths (confirmed by reproducer tests)

||Component||Mechanism||
|{{RelOptUtil.pushPastProjectUnlessBloat}}|The shuttle returned by 
{{pushShuttle(Project)}} substitutes every {{RexInputRef}} with the underlying 
projection. Only {{RexOver}} is guarded; determinism is not.|
|{{RelBuilder.project_}} (bloat merge)|Calls {{pushPastProjectUnlessBloat}} 
when merging adjacent {{Project}}s. This is why the bug surfaces in the very 
first {{RelNode}} produced by {{SqlToRelConverter}}, before any rule fires.|
|{{ProjectMergeRule}}|Calls {{pushPastProjectUnlessBloat}}. Test: 1 → 2 
{{RAND()}} calls.|
|{{FilterProjectTransposeRule}}|Calls {{pushPastProjectUnlessBloat}} on the 
filter condition, and re-emits the original projection above the new filter. 
Even with the helper guarded, the rebuilt {{Project(rand() AS a)}} on top of
{{Filter(rand() > ...)}} splits one evaluation into two. Test: 1 → 2.|
|{{JoinProjectTransposeRule}}|Builds {{RexProgram}}s for projection + join 
condition, merges them via {{RexProgramBuilder.mergePrograms}}, then calls 
{{mergedProgram.expandLocalRef(condition)}}. The expansion flattens local refs
back into a {{RexNode}} tree, inlining the underlying expression at every 
occurrence. Test: 1 → 3.|
|{{SemiJoinProjectTransposeRule}}|Same mechanism as 
{{JoinProjectTransposeRule}} — uses {{RexProgramBuilder.mergePrograms}} + 
{{expandLocalRef}} to inline projected expressions into the semi-join 
condition. Test: 1 → 2.|

h3. Confirmed unaffected (by the same tests)

- {{CalcMergeRule}} — {{RexProgram}} represents intermediate values as 
{{RexLocalRef}}s, so the merged program contains a single {{expr#N=[RAND()]}} 
referenced multiple times. Sharing is preserved.
- {{ProjectReduceExpressionsRule}} — refuses to fold non-deterministic terms.
- {{ProjectValuesMergeRule}} / {{ValuesReduceRule}} — does not fire when the 
projected expression is non-reducible.
- {{ProjectFilterTransposeRule}}, {{ProjectJoinTransposeRule}}, 
{{ProjectCorrelateTransposeRule}}, {{AggregateProjectMergeRule}} — confirmed 
safe on tested patterns.
- {{FilterSetOpTransposeRule}} — textual count of {{RAND(}} doubles when the 
filter is cloned to each {{UNION ALL}} branch, but this is NOT a semantic bug: 
any single row only flows through one branch's filter, so per-row
evaluation count is unchanged.

h3. Note on dynamic vs non-deterministic ({{CURRENT_TIMESTAMP}})

Calcite already distinguishes "non-deterministic" from "dynamic" via two 
independent operator flags, although the naming is non-obvious:

||Operator||{{isDeterministic()}}||{{isDynamicFunction()}}||Semantics||
|{{RAND()}}|false|false|Different value per call site, even within one 
statement|
|{{CURRENT_TIMESTAMP}}|true (default; not overridden)|true|Same value at every 
call site within one statement; can differ across executions|

So in Calcite's model {{CURRENT_TIMESTAMP}} is "deterministic per statement" — 
duplicating its call sites is semantically harmless because all occurrences 
resolve to the same instant at runtime. The {{isDynamicFunction()}} flag 
governs plan-cache invalidation across executions, not duplication safety 
within one execution.

The fix uses {{RexUtil.isDeterministic(...)}}, which already encodes this 
distinction correctly: {{RAND}} is blocked from duplication; 
{{CURRENT_TIMESTAMP}} (and the rest of {{SqlAbstractTimeFunction}}) is left 
alone. No refinement of Calcite's determinism notion is required for this fix, 
although the comment thread notes that the two-flag model is worth revisiting 
more broadly — that is out of scope here.

h3. Proposed fix

Four small changes, all in the spirit of the existing "skip rule if it would 
touch non-determinism" pattern already used elsewhere in Calcite 
({{ProjectReduceExpressionsRule}}, the {{containsOver}} skip in transpose 
rules, etc.):

{{RelOptUtil.pushPastProjectUnlessBloat}} — count {{RexInputRef}} uses per 
bottom slot; if any bottom expression is non-deterministic 
({{!RexUtil.isDeterministic(...)}}) and is referenced more than once, return 
{{null}}. Fixes 
{{ProjectMergeRule}}, {{FilterProjectTransposeRule}}'s helper call, and the 
{{RelBuilder}} bloat merge in one place.

{{FilterProjectTransposeRule}} — additionally skip when any projected 
expression is non-deterministic, parallel to the existing {{containsOver}} 
skip. Required because the rule re-emits the original projection above the new 
filter.

{{JoinProjectTransposeRule}} — skip non-deterministic projections, parallel to 
the existing {{containsOver}} skip on each side.

{{SemiJoinProjectTransposeRule}} — same skip as {{JoinProjectTransposeRule}} 
(uses the same {{mergePrograms}} + {{expandLocalRef}} pattern).

  was:
Several optimizer and {{RelBuilder}}-level transformations inline a 
lower-projection expression into multiple positions of an upper expression 
without checking whether the inlined expression is non-deterministic. When the 
lower
expression is non-deterministic in the strict sense (i.e. two calls within the 
same statement may return different values — {{RAND()}}, sequence {{NEXTVAL}}, 
UUID generators), this silently converts one evaluation into many,
changing query semantics.

{{CURRENT_TIMESTAMP}} and the other time functions are not in scope: Calcite 
considers them deterministic per statement (see the "Note on dynamic vs 
non-deterministic" section below) and they are safe to duplicate.

h3. Minimal repro

{code:sql}
SELECT a, a + 1 AS b
FROM (SELECT rand() AS a)
{code}

h3. Current plan produced by {{SqlToRelConverter}}:

{noformat}
LogicalProject(A=[RAND()], B=[+(RAND(), 1)])
LogicalValues(tuples=[[{ 0 }]])
{noformat}

The two {{RAND()}} calls are evaluated independently, so {{B - A}} is no longer 
always {{1}}. The expected plan preserves the single evaluation by referencing 
the projected column:

{noformat}
LogicalProject(A=[$0], B=[+($0, 1)])
LogicalProject(A=[RAND()])
    LogicalValues(tuples=[[{ 0 }]])
{noformat}

h3. Affected code paths (confirmed by reproducer tests)

||Component||Mechanism||
|{{RelOptUtil.pushPastProjectUnlessBloat}}|The shuttle returned by 
{{pushShuttle(Project)}} substitutes every {{RexInputRef}} with the underlying 
projection. Only {{RexOver}} is guarded; determinism is not.|
|{{RelBuilder.project_}} (bloat merge)|Calls {{pushPastProjectUnlessBloat}} 
when merging adjacent {{Project}}s. This is why the bug surfaces in the very 
first {{RelNode}} produced by {{SqlToRelConverter}}, before any rule fires.|
|{{ProjectMergeRule}}|Calls {{pushPastProjectUnlessBloat}}. Test: 1 → 2 
{{RAND()}} calls.|
|{{FilterProjectTransposeRule}}|Calls {{pushPastProjectUnlessBloat}} on the 
filter condition, and re-emits the original projection above the new filter. 
Even with the helper guarded, the rebuilt {{Project(rand() AS a)}} on top of
{{Filter(rand() > ...)}} splits one evaluation into two. Test: 1 → 2.|
|{{JoinProjectTransposeRule}}|Builds {{RexProgram}}s for projection + join 
condition, merges them via {{RexProgramBuilder.mergePrograms}}, then calls 
{{mergedProgram.expandLocalRef(condition)}}. The expansion flattens local refs
back into a {{RexNode}} tree, inlining the underlying expression at every 
occurrence. Test: 1 → 3.|
|{{SemiJoinProjectTransposeRule}}|Same mechanism as 
{{JoinProjectTransposeRule}} — uses {{RexProgramBuilder.mergePrograms}} + 
{{expandLocalRef}} to inline projected expressions into the semi-join 
condition. Test: 1 → 2.|

h3. Confirmed unaffected (by the same tests)

- {{CalcMergeRule}} — {{RexProgram}} represents intermediate values as 
{{RexLocalRef}}s, so the merged program contains a single {{expr#N=[RAND()]}} 
referenced multiple times. Sharing is preserved.
- {{ProjectReduceExpressionsRule}} — refuses to fold non-deterministic terms.
- {{ProjectValuesMergeRule}} / {{ValuesReduceRule}} — does not fire when the 
projected expression is non-reducible.
- {{ProjectFilterTransposeRule}}, {{ProjectJoinTransposeRule}}, 
{{ProjectCorrelateTransposeRule}}, {{AggregateProjectMergeRule}} — confirmed 
safe on tested patterns.
- {{FilterSetOpTransposeRule}} — textual count of {{RAND(}} doubles when the 
filter is cloned to each {{UNION ALL}} branch, but this is NOT a semantic bug: 
any single row only flows through one branch's filter, so per-row
evaluation count is unchanged.

h3. Note on dynamic vs non-deterministic ({{CURRENT_TIMESTAMP}})

Calcite already distinguishes "non-deterministic" from "dynamic" via two 
independent operator flags, although the naming is non-obvious:

||Operator||{{isDeterministic()}}||{{isDynamicFunction()}}||Semantics||
|{{RAND()}}|false|false|Different value per call site, even within one 
statement|
|{{CURRENT_TIMESTAMP}}|true (default; not overridden)|true|Same value at every 
call site within one statement; can differ across executions|

So in Calcite's model {{CURRENT_TIMESTAMP}} is "deterministic per statement" — 
duplicating its call sites is semantically harmless because all occurrences 
resolve to the same instant at runtime. The {{isDynamicFunction()}} flag
governs plan-cache invalidation across executions, not duplication safety 
within one execution.

The fix uses {{RexUtil.isDeterministic(...)}}, which already encodes this 
distinction correctly: {{RAND}} is blocked from duplication; 
{{CURRENT_TIMESTAMP}} (and the rest of {{SqlAbstractTimeFunction}}) is left 
alone. No refinement
of Calcite's determinism notion is required for this fix, although the comment 
thread notes that the two-flag model is worth revisiting more broadly — that is 
out of scope here.

h3. Proposed fix

Four small changes, all in the spirit of the existing "skip rule if it would 
touch non-determinism" pattern already used elsewhere in Calcite 
({{ProjectReduceExpressionsRule}}, the {{containsOver}} skip in transpose 
rules, etc.):

{{RelOptUtil.pushPastProjectUnlessBloat}} — count {{RexInputRef}} uses per 
bottom slot; if any bottom expression is non-deterministic 
({{!RexUtil.isDeterministic(...)}}) and is referenced more than once, return 
{{null}}. Fixes 
{{ProjectMergeRule}}, {{FilterProjectTransposeRule}}'s helper call, and the 
{{RelBuilder}} bloat merge in one place.

{{FilterProjectTransposeRule}} — additionally skip when any projected 
expression is non-deterministic, parallel to the existing {{containsOver}} 
skip. Required because the rule re-emits the original projection above the new 
filter.

{{JoinProjectTransposeRule}} — skip non-deterministic projections, parallel to 
the existing {{containsOver}} skip on each side.

{{SemiJoinProjectTransposeRule}} — same skip as {{JoinProjectTransposeRule}} 
(uses the same {{mergePrograms}} + {{expandLocalRef}} pattern).


> Project/Filter/Join transpose and merge rules can duplicate non-deterministic 
> expressions (e.g. RAND())
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-7551
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7551
>             Project: Calcite
>          Issue Type: Bug
>    Affects Versions: 1.41.0
>            Reporter: Darpan Lunagariya (e6data computing)
>            Assignee: Darpan Lunagariya (e6data computing)
>            Priority: Major
>
> Several optimizer and {{RelBuilder}}-level transformations inline a 
> lower-projection expression into multiple positions of an upper expression 
> without checking whether the inlined expression is non-deterministic. When 
> the lower expression is non-deterministic in the strict sense (i.e. two calls 
> within the same statement may return different values — {{RAND()}}, sequence 
> {{NEXTVAL}}, UUID generators), this silently converts one evaluation into 
> many, changing query semantics.
> {{CURRENT_TIMESTAMP}} and the other time functions are not in scope: Calcite 
> considers them deterministic per statement (see the "Note on dynamic vs 
> non-deterministic" section below) and they are safe to duplicate.
> h3. Minimal repro
> {code:sql}
> SELECT a, a + 1 AS b
> FROM (SELECT rand() AS a)
> {code}
> h3. Current plan produced by {{SqlToRelConverter}}:
> {noformat}
> LogicalProject(A=[RAND()], B=[+(RAND(), 1)])
> LogicalValues(tuples=[[{ 0 }]])
> {noformat}
> The two {{RAND()}} calls are evaluated independently, so {{B - A}} is no 
> longer always {{1}}. The expected plan preserves the single evaluation by 
> referencing the projected column:
> {noformat}
> LogicalProject(A=[$0], B=[+($0, 1)])
> LogicalProject(A=[RAND()])
>     LogicalValues(tuples=[[{ 0 }]])
> {noformat}
> h3. Affected code paths (confirmed by reproducer tests)
> ||Component||Mechanism||
> |{{RelOptUtil.pushPastProjectUnlessBloat}}|The shuttle returned by 
> {{pushShuttle(Project)}} substitutes every {{RexInputRef}} with the 
> underlying projection. Only {{RexOver}} is guarded; determinism is not.|
> |{{RelBuilder.project_}} (bloat merge)|Calls {{pushPastProjectUnlessBloat}} 
> when merging adjacent {{Project}}s. This is why the bug surfaces in the very 
> first {{RelNode}} produced by {{SqlToRelConverter}}, before any rule fires.|
> |{{ProjectMergeRule}}|Calls {{pushPastProjectUnlessBloat}}. Test: 1 → 2 
> {{RAND()}} calls.|
> |{{FilterProjectTransposeRule}}|Calls {{pushPastProjectUnlessBloat}} on the 
> filter condition, and re-emits the original projection above the new filter. 
> Even with the helper guarded, the rebuilt {{Project(rand() AS a)}} on top of
> {{Filter(rand() > ...)}} splits one evaluation into two. Test: 1 → 2.|
> |{{JoinProjectTransposeRule}}|Builds {{RexProgram}}s for projection + join 
> condition, merges them via {{RexProgramBuilder.mergePrograms}}, then calls 
> {{mergedProgram.expandLocalRef(condition)}}. The expansion flattens local refs
> back into a {{RexNode}} tree, inlining the underlying expression at every 
> occurrence. Test: 1 → 3.|
> |{{SemiJoinProjectTransposeRule}}|Same mechanism as 
> {{JoinProjectTransposeRule}} — uses {{RexProgramBuilder.mergePrograms}} + 
> {{expandLocalRef}} to inline projected expressions into the semi-join 
> condition. Test: 1 → 2.|
> h3. Confirmed unaffected (by the same tests)
> - {{CalcMergeRule}} — {{RexProgram}} represents intermediate values as 
> {{RexLocalRef}}s, so the merged program contains a single {{expr#N=[RAND()]}} 
> referenced multiple times. Sharing is preserved.
> - {{ProjectReduceExpressionsRule}} — refuses to fold non-deterministic terms.
> - {{ProjectValuesMergeRule}} / {{ValuesReduceRule}} — does not fire when the 
> projected expression is non-reducible.
> - {{ProjectFilterTransposeRule}}, {{ProjectJoinTransposeRule}}, 
> {{ProjectCorrelateTransposeRule}}, {{AggregateProjectMergeRule}} — confirmed 
> safe on tested patterns.
> - {{FilterSetOpTransposeRule}} — textual count of {{RAND(}} doubles when the 
> filter is cloned to each {{UNION ALL}} branch, but this is NOT a semantic 
> bug: any single row only flows through one branch's filter, so per-row
> evaluation count is unchanged.
> h3. Note on dynamic vs non-deterministic ({{CURRENT_TIMESTAMP}})
> Calcite already distinguishes "non-deterministic" from "dynamic" via two 
> independent operator flags, although the naming is non-obvious:
> ||Operator||{{isDeterministic()}}||{{isDynamicFunction()}}||Semantics||
> |{{RAND()}}|false|false|Different value per call site, even within one 
> statement|
> |{{CURRENT_TIMESTAMP}}|true (default; not overridden)|true|Same value at 
> every call site within one statement; can differ across executions|
> So in Calcite's model {{CURRENT_TIMESTAMP}} is "deterministic per statement" 
> — duplicating its call sites is semantically harmless because all occurrences 
> resolve to the same instant at runtime. The {{isDynamicFunction()}} flag 
> governs plan-cache invalidation across executions, not duplication safety 
> within one execution.
> The fix uses {{RexUtil.isDeterministic(...)}}, which already encodes this 
> distinction correctly: {{RAND}} is blocked from duplication; 
> {{CURRENT_TIMESTAMP}} (and the rest of {{SqlAbstractTimeFunction}}) is left 
> alone. No refinement of Calcite's determinism notion is required for this 
> fix, although the comment thread notes that the two-flag model is worth 
> revisiting more broadly — that is out of scope here.
> h3. Proposed fix
> Four small changes, all in the spirit of the existing "skip rule if it would 
> touch non-determinism" pattern already used elsewhere in Calcite 
> ({{ProjectReduceExpressionsRule}}, the {{containsOver}} skip in transpose 
> rules, etc.):
> {{RelOptUtil.pushPastProjectUnlessBloat}} — count {{RexInputRef}} uses per 
> bottom slot; if any bottom expression is non-deterministic 
> ({{!RexUtil.isDeterministic(...)}}) and is referenced more than once, return 
> {{null}}. Fixes 
> {{ProjectMergeRule}}, {{FilterProjectTransposeRule}}'s helper call, and the 
> {{RelBuilder}} bloat merge in one place.
> {{FilterProjectTransposeRule}} — additionally skip when any projected 
> expression is non-deterministic, parallel to the existing {{containsOver}} 
> skip. Required because the rule re-emits the original projection above the 
> new filter.
> {{JoinProjectTransposeRule}} — skip non-deterministic projections, parallel 
> to the existing {{containsOver}} skip on each side.
> {{SemiJoinProjectTransposeRule}} — same skip as {{JoinProjectTransposeRule}} 
> (uses the same {{mergePrograms}} + {{expandLocalRef}} pattern).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to