[
https://issues.apache.org/jira/browse/CALCITE-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073297#comment-18073297
]
Alessandro Solimando edited comment on CALCITE-7463 at 4/13/26 7:40 PM:
------------------------------------------------------------------------
Summarizing my thoughts after the discussion here and reviewing the PR.
I agree that LIMIT without ORDER BY is nondeterministic. However, collapsing
two independently-evaluated LIMIT 2 branches into a single one strictly narrows
the set of possible outcomes (at most 2 rows instead of up to 4 before dedup).
Being conservative as [~jensen] suggests seems safer to me.
If we agree on this, I suggest we introduce a new metadata property (e.g.,
{_}isRowLimited{_}), capturing whether a subtree depends on row-limiting
operations, and use that to detect and bail out on the pattern.
Re. what Julian says around CTEs: this seems strongly related to [~zabetak]'s
work on {_}RelCommonExpressionSuggester{_}, I wonder if he has given the topic
a thought. IIRC _RelCommonExpressionBasicSuggester_ considers _deepEquals_
subtrees as identical and replaceable with a single CTE. Would it lead to the
same situation we have now?
Since Julian cites BigQuery as an example where this rewrite would be legal,
but there are potentially others where it is not, maybe we should have a
"method" in an interface, that downstream systems can override to decide, given
two (identical) sub-expressions, if they can be considered to return the same
identical result set?
At that point the rule would use that deciding point and bail out for cases
when the collapsing wouldn't be safe.
WDYT?
was (Author: asolimando):
Summarizing my thoughts after the discussion here and reviewing the PR.
I agree that LIMIT without ORDER BY is nondeterministic. However, collapsing
two independently-evaluated LIMIT 2 branches into a single one strictly narrows
the set of possible outcomes (at most 2 rows instead of up to 4 before dedup).
Being conservative as [~jensen] suggests seems safer to me.
If we agree on this, I suggest we introduce a new metadata property (e.g.,
{_}isRowLimited{_}), capturing whether a subtree depends on row-limiting
operations, and use that to detect and bail out on the pattern.
Re. what Julian says around CTEs: this seems strongly related to [~zabetak]'s
work on {_}RelCommonExpressionSuggester{_}, I wonder if he has given the topic
a thought. IIRC _RelCommonExpressionBasicSuggester_ considers _deepEquals_
subtrees as identical and replaceable with a single CTE. Would it lead to the
same situation we have now?
Since Julian cites BigQuery as an example where this rewrite would be legal,
but there are potentially others where it is not, maybe we should have a
"method" in an interface, that downstream systems can override to decide, given
two (identical) expressions, if they are the same or not?
At that point the rule would use that deciding point and bail out for cases
when the collapsing wouldn't be safe.
WDYT?
> UnionToFilterRule incorrectly rewrites UNION with LIMIT
> -------------------------------------------------------
>
> Key: CALCITE-7463
> URL: https://issues.apache.org/jira/browse/CALCITE-7463
> Project: Calcite
> Issue Type: Bug
> Components: core
> Affects Versions: 1.41.0
> Reporter: Zhen Chen
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.42.0
>
>
> The {{UnionToFilterRule}} produces incorrect results when applied to inputs
> that contain {{{}LIMIT{}}}.
> Specifically, the rule incorrectly collapses:
> {code:java}
> (SELECT mgr, comm FROM emp LIMIT 2)
> UNION
> (SELECT mgr, comm FROM emp LIMIT 2) {code}
> into:
> {code:java}
> SELECT DISTINCT mgr, comm FROM emp LIMIT 2 {code}
> This transformation is {*}not semantically equivalent{*}.
> *Reproduction*
> SQL
> {code:java}
> (SELECT mgr, comm FROM emp LIMIT 2)
> UNION
> (SELECT mgr, comm FROM emp LIMIT 2) {code}
> h4. Plan Before
> {code:java}
> LogicalUnion(all=[false])
> LogicalSort(fetch=[2])
> LogicalProject(MGR=[$3], COMM=[$6])
> LogicalTableScan(table=[[CATALOG, SALES, EMP]])
> LogicalSort(fetch=[2])
> LogicalProject(MGR=[$3], COMM=[$6])
> LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code}
> *Plan After (Incorrect)*
> {code:java}
> LogicalAggregate(group=[{0, 1}])
> LogicalSort(fetch=[2])
> LogicalProject(MGR=[$3], COMM=[$6])
> LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code}
> *Expected Behavior*
> The transformation should NOT be applied when any input of UNION contains
> LogicalSort(That contains ORDER BY, LIMIT, OFFSET).
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)