[ https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824521#comment-17824521 ]
Jack Chen edited comment on SPARK-36113 at 3/7/24 8:52 PM: ----------------------------------------------------------- The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin at the top of the subquery plan and flip it to inner - but then we also need to make sure the expression for count bug handling {{if(isnull...), 0, cnt)}} is in the right place, in the newly built outer join. In general, I think we want to move the logic for determining whether there's a count bug earlier to DecorrelateInnerQuery, which unifies the logic between scalar and non-scalar subqueries, and allows us to avoid a whole class of bugs with the doing the scalar subquery count bug handling later - see e.g. [https://github.com/apache/spark/pull/40811] and [https://github.com/apache/spark/pull/45125]. And see [https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit] for more on why this decorrelation happens in two steps. was (Author: JIRAUSER299035): The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin at the top of the subquery plan and flip it to inner. This way all the logic for determining whether there's a count bug happens earlier in DecorrelateInnerQuery, which unifies the logic and allows us to avoid a whole class of bugs with the doing the scalar subquery count bug handling later - see e.g. [https://github.com/apache/spark/pull/40811] and [https://github.com/apache/spark/pull/45125]. And see [https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit] for more on why this decorrelation happens in two steps. > Unify the logic to handle COUNT bug for scalar and lateral subqueries > --------------------------------------------------------------------- > > Key: SPARK-36113 > URL: https://issues.apache.org/jira/browse/SPARK-36113 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.2.0 > Reporter: Allison Wang > Priority: Major > > Currently, for scalar subqueries, the logic to handle the count bug is in > `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral > subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. > We should unify them without introducing additional left outer joins for > scalar subqueries. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org