ahshahid opened a new pull request, #38714:
URL: https://github.com/apache/spark/pull/38714

   ### What changes were proposed in this pull request?
   This is a PR for improvement
   When a subquery references the outer query's aggregate functions,  in some 
cases, it ends up introducing extra aggregate functions which are not needed. 
Though they would get eventually eliminated in the optimizer, but atleast in 
analyzer phase would add an extra project node etc.
   The change is in the code of identification of OuterReference in 
subquery.scala.
   Currently whenever an aggregate expression is found, it is assumed to be the 
Outer Reference.
   With this change,  the code checks whether the parent Expression can also be 
potentially part of the OuterReference too.
   So if we consider a query
   select cos (sum (a) ) , b from t1 having exists select 1 from t2 where x = 
cos ( sum(a) ) 
   
   the OuterReference detected would be cos ( sum(a) ) instead of just sum(a).
   As a result, no extra aggregate would be added.
   
   
   ### Why are the changes needed?
   To avoid adding unnecessary aggregate in outer query thereby reducing the 
number of expressions to analyze, clone and also avoid adding an extra project 
node.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Ran the precheckin tests and added new tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to