[ 
https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824521#comment-17824521
 ] 

Jack Chen edited comment on SPARK-36113 at 3/7/24 8:47 PM:
-----------------------------------------------------------

The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin 
at the top of the subquery plan and flip it to inner. This way all the logic 
for determining whether there's a count bug happens earlier in 
DecorrelateInnerQuery, which unifies the logic and allows us to avoid a whole 
class of bugs with the doing the scalar subquery count bug handling later - see 
e.g. [https://github.com/apache/spark/pull/40811] and 
[https://github.com/apache/spark/pull/45125]. And see 
[https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit]
 for more on why this decorrelation happens in two steps.


was (Author: JIRAUSER299035):
The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Adding that outer join still needs to happen in the same place 
(rewriteDomainJoins), but perhaps we can move the logic of checking whether the 
subquery is affected by count bug all to DecorrelateInnerQuery, and that just 
leaves a flag for rewriteDomainJoins/constructLeftJoins later on.

> Unify the logic to handle COUNT bug for scalar and lateral subqueries
> ---------------------------------------------------------------------
>
>                 Key: SPARK-36113
>                 URL: https://issues.apache.org/jira/browse/SPARK-36113
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Allison Wang
>            Priority: Major
>
> Currently, for scalar subqueries, the logic to handle the count bug is in 
> `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral 
> subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. 
> We should unify them without introducing additional left outer joins for 
> scalar subqueries.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to