[ 
https://issues.apache.org/jira/browse/CALCITE-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074378#comment-18074378
 ] 

Steven Phillips commented on CALCITE-7463:
------------------------------------------

The point of optimizing queries is for them to perform faster with less 
resources. It's the goal of a query optimizer. And the unoptimized query in 
this case could potentially be very expensive, e.g. if the limit is much 
higher, and there are more than just 2 instances. And I'm generally not a fan 
of removing optimizations that give correct results.

My point is, in this particular case, it is entirely, 100% up to the optimizer 
to decide what to return.  When I request 2 arbitrary rows from a relation, 
without specifying an order by, it is valid for the optimizer to choose any 
plan whatsoever that satisfies that request. If it decides that a plan that 
returns the exact same result set every time is the best plan, that is 
perfectly fine. If the planner decides some other plan, which potentially 
returns different results each time is best, that's also valid. The planner can 
also choose a plan that returns the same result every time in some cases, and a 
plan that returns different results every time in other cases, and that's still 
fine. The only requirement is that planner gives a plan that returns results 
that are consistent with what the query requests.

So I disagree with [~julianhyde] slightly, in that, I think this optimization 
is valid if we don't consider CTEs/subqueries to be repeatable, which is 
different from how we should handle the RANDOM() function, as RANDOM() comes 
with an expectation of randomness, whereas LIMIT means give me any result that 
matches, I don't care how.

In other words, I think


{code:sql}
select random()
union
select random()
{code}

should not be reduced, but if we treat it as a REPEATABLE CTE like Julian is 
suggesting, then it could be reduced, since that's what repeatable cte means.

but 

{code:sql}
(SELECT mgr, comm FROM emp LIMIT 2)
UNION
(SELECT mgr, comm FROM emp LIMIT 2)
{code}

has no expectation of randomness, so reducing it is valid either way.

Now, if we decide that it makes things simpler to handle by treating the 
"arbitrary" queries like LIMIT the same as "explcit random" queries, that could 
be a valid decision we could make. 

And I still am confused why anyone would want to make sure it doesn't reduce 
this case. In many cases, this query would return the same result anyway, e.g. 
when the table is small, it's likely many engines would start reading from the 
beginning of the underlying file. If the goal is to force some 
randomness/arbitrariness into the execution engine, this isn't the way to do it.

> UnionToFilterRule incorrectly rewrites UNION with LIMIT
> -------------------------------------------------------
>
>                 Key: CALCITE-7463
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7463
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.41.0
>            Reporter: Zhen Chen
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.42.0
>
>
> The {{UnionToFilterRule}} produces incorrect results when applied to inputs 
> that contain {{{}LIMIT{}}}.
> Specifically, the rule incorrectly collapses:
> {code:java}
> (SELECT mgr, comm FROM emp LIMIT 2)
> UNION
> (SELECT mgr, comm FROM emp LIMIT 2) {code}
> into:
> {code:java}
> SELECT DISTINCT mgr, comm FROM emp LIMIT 2 {code}
> This transformation is {*}not semantically equivalent{*}.
> *Reproduction*
> SQL
> {code:java}
> (SELECT mgr, comm FROM emp LIMIT 2)
> UNION
> (SELECT mgr, comm FROM emp LIMIT 2) {code}
> h4. Plan Before
> {code:java}
> LogicalUnion(all=[false])
>   LogicalSort(fetch=[2])
>     LogicalProject(MGR=[$3], COMM=[$6])
>       LogicalTableScan(table=[[CATALOG, SALES, EMP]])
>   LogicalSort(fetch=[2])
>     LogicalProject(MGR=[$3], COMM=[$6])
>       LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code}
> *Plan After (Incorrect)*
> {code:java}
> LogicalAggregate(group=[{0, 1}])
>   LogicalSort(fetch=[2])
>     LogicalProject(MGR=[$3], COMM=[$6])
>       LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code}
> *Expected Behavior*
> The transformation should NOT be applied when any input of UNION contains 
> LogicalSort(That contains ORDER BY, LIMIT, OFFSET). 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to