[ 
https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824521#comment-17824521
 ] 

Jack Chen edited comment on SPARK-36113 at 3/7/24 8:52 PM:
-----------------------------------------------------------

The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin 
at the top of the subquery plan and flip it to inner - but then we also need to 
make sure the expression for count bug handling {{if(isnull...), 0, cnt)}} is 
in the right place, in the newly built outer join.

In general, I think we want to move the logic for determining whether there's a 
count bug earlier to DecorrelateInnerQuery, which unifies the logic between 
scalar and non-scalar subqueries, and allows us to avoid a whole class of bugs 
with the doing the scalar subquery count bug handling later - see e.g. 
[https://github.com/apache/spark/pull/40811] and 
[https://github.com/apache/spark/pull/45125]. And see 
[https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit]
 for more on why this decorrelation happens in two steps.


was (Author: JIRAUSER299035):
The tricky aspect of this is that if we do the count bug handling in 
DecorrelateInnerQuery, we end up adding an extra outer join for scalar 
subqueries with count. This is because scalar subqueries always require a left 
join anyway (even without count) because if there's no data coming out of the 
subquery we still need to keep the row - that's added in constructLeftJoins. 
And if we handle count bug in DecorrelateInnerQuery using the existing logic, 
it would create another left join, which may be in the middle of the subquery 
plan.

I think to solve this we want to check whether this is a count agg at the top 
of the scalar subquery, and specially handle that case to generate a single 
outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin 
at the top of the subquery plan and flip it to inner. This way all the logic 
for determining whether there's a count bug happens earlier in 
DecorrelateInnerQuery, which unifies the logic and allows us to avoid a whole 
class of bugs with the doing the scalar subquery count bug handling later - see 
e.g. [https://github.com/apache/spark/pull/40811] and 
[https://github.com/apache/spark/pull/45125]. And see 
[https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit]
 for more on why this decorrelation happens in two steps.

> Unify the logic to handle COUNT bug for scalar and lateral subqueries
> ---------------------------------------------------------------------
>
>                 Key: SPARK-36113
>                 URL: https://issues.apache.org/jira/browse/SPARK-36113
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Allison Wang
>            Priority: Major
>
> Currently, for scalar subqueries, the logic to handle the count bug is in 
> `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral 
> subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. 
> We should unify them without introducing additional left outer joins for 
> scalar subqueries.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to