[ https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Song Jun updated SPARK-27229: ----------------------------- Priority: Minor (was: Major) > GroupBy Placement in Intersect Distinct > --------------------------------------- > > Key: SPARK-27229 > URL: https://issues.apache.org/jira/browse/SPARK-27229 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Song Jun > Priority: Minor > > Intersect operator will be replace by Left Semi Join in Optimizer. > for example: > SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 > ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND > a2<=>b2 > if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce > the table data before > Join by place groupby operator under join, that is > ==> > SELECT a1, a2 FROM > (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X > LEFT SEMI JOIN > (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y > ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 > then we can have smaller table data when execute join, because group by has > cut lots of > data. > > A pr will be submit soon -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org