[GitHub] [spark] cloud-fan commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

via GitHub Fri, 04 Aug 2023 02:14:48 -0700


cloud-fan commented on PR #42223:
URL: https://github.com/apache/spark/pull/42223#issuecomment-1665295635


   @peter-toth I agree that the extra project can help if we decided to merge. 
However, the plan pattern becomes complicated. Without the extra project, the 
merged aggregate is still `Aggregate -> Filter -> Scan`. We can just define a 
rule for merging two aggregates, and it can incrementally merge all joined 
aggregates.  With the extra project, we need to define how to merge `Aggregate 
-> Filter -> Project -> Scan` + `Aggregate -> Filter -> Project -> Scan`, or 
`Aggregate -> Filter -> Project -> Scan` + `Aggregate -> Filter -> Scan`
   
   I think a few extra boolean columns won't increase the shuffle size too 
much. When we have a partial aggregate logical plan, we can strip the added 
boolean columns right after partial aggregate, but before final aggregate, as 
the aggregate function filter is only evaluated in partial aggregate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

Reply via email to