[ https://issues.apache.org/jira/browse/SPARK-36113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824521#comment-17824521 ]
Jack Chen edited comment on SPARK-36113 at 3/7/24 8:47 PM: ----------------------------------------------------------- The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Perhaps constructLeftJoins can check if there's a left DomainJoin at the top of the subquery plan and flip it to inner. This way all the logic for determining whether there's a count bug happens earlier in DecorrelateInnerQuery, which unifies the logic and allows us to avoid a whole class of bugs with the doing the scalar subquery count bug handling later - see e.g. [https://github.com/apache/spark/pull/40811] and [https://github.com/apache/spark/pull/45125]. And see [https://docs.google.com/document/d/1YCce0DtJ6NkMP1QM1nnfYvkiH1CkHoGyMjKQ8MX5bUM/edit] for more on why this decorrelation happens in two steps. was (Author: JIRAUSER299035): The tricky aspect of this is that if we do the count bug handling in DecorrelateInnerQuery, we end up adding an extra outer join for scalar subqueries with count. This is because scalar subqueries always require a left join anyway (even without count) because if there's no data coming out of the subquery we still need to keep the row - that's added in constructLeftJoins. And if we handle count bug in DecorrelateInnerQuery using the existing logic, it would create another left join, which may be in the middle of the subquery plan. I think to solve this we want to check whether this is a count agg at the top of the scalar subquery, and specially handle that case to generate a single outer join. Adding that outer join still needs to happen in the same place (rewriteDomainJoins), but perhaps we can move the logic of checking whether the subquery is affected by count bug all to DecorrelateInnerQuery, and that just leaves a flag for rewriteDomainJoins/constructLeftJoins later on. > Unify the logic to handle COUNT bug for scalar and lateral subqueries > --------------------------------------------------------------------- > > Key: SPARK-36113 > URL: https://issues.apache.org/jira/browse/SPARK-36113 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.2.0 > Reporter: Allison Wang > Priority: Major > > Currently, for scalar subqueries, the logic to handle the count bug is in > `RewriteCorrelatedScalarSubquery` (constructLeftJoins) while lateral > subqueries rely on `DecorrelateInnerQuery` to handle the COUNT bug. > We should unify them without introducing additional left outer joins for > scalar subqueries. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org