allisonwang-db opened a new pull request #32687:
URL: https://github.com/apache/spark/pull/32687


   ### What changes were proposed in this pull request?
   This PR refactors `SubqueryExpression` class. It removes the children field 
from SubqueryExpression's constructor and adds `outerAttrs` and `joinCond`. 
   
   ### Why are the changes needed?
   Currently, the children field of a subquery expression is used to store both 
collected outer references in the subquery plan and join conditions after 
correlated predicates are pulled up.
   
   For example:
   `SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2`
   
   During the analysis phase, outer references in the subquery are stored in 
the children field: `scalar-subquery [t2.c1]`, but after the optimizer rule 
`PullupCorrelatedPredicates`, the children field will be used to store the join 
conditions, which contain both the inner and the outer references: 
`scalar-subquery [t1.c1 = t2.c1]`. This is why the references of 
SubqueryExpression excludes the inner plan's output:
   
https://github.com/apache/spark/blob/29ed1a2de42e7a663f764192fce157a9f23029b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala#L68-L69
   
   This can be confusing and error-prone. The references for a subquery 
expression should always be defined as outer attribute references.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Existing tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to