[jira] [Commented] (SPARK-27229) GroupBy Placement in Intersect Distinct
[ https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006748#comment-17006748 ] Takeshi Yamamuro commented on SPARK-27229: -- I'll close this for now because the corresponding pr is stale. Please reopen this if necessary. > GroupBy Placement in Intersect Distinct > --- > > Key: SPARK-27229 > URL: https://issues.apache.org/jira/browse/SPARK-27229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > Intersect operator will be replace by Left Semi Join in Optimizer. > for example: > SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 > ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND > a2<=>b2 > if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce > the table data before > Join by place groupby operator under join, that is > ==> > SELECT a1, a2 FROM >(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X >LEFT SEMI JOIN >(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y > ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 > then we can have smaller table data when execute join, because group by has > cut lots of > data. > > A pr will be submit soon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27229) GroupBy Placement in Intersect Distinct
[ https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798748#comment-16798748 ] Song Jun commented on SPARK-27229: -- Thanks > GroupBy Placement in Intersect Distinct > --- > > Key: SPARK-27229 > URL: https://issues.apache.org/jira/browse/SPARK-27229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > Intersect operator will be replace by Left Semi Join in Optimizer. > for example: > SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 > ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND > a2<=>b2 > if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce > the table data before > Join by place groupby operator under join, that is > ==> > SELECT a1, a2 FROM >(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X >LEFT SEMI JOIN >(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y > ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 > then we can have smaller table data when execute join, because group by has > cut lots of > data. > > A pr will be submit soon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27229) GroupBy Placement in Intersect Distinct
[ https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798681#comment-16798681 ] Hyukjin Kwon commented on SPARK-27229: -- Please avoid to set Critical+ which is usually reserved for committers. > GroupBy Placement in Intersect Distinct > --- > > Key: SPARK-27229 > URL: https://issues.apache.org/jira/browse/SPARK-27229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > Intersect operator will be replace by Left Semi Join in Optimizer. > for example: > SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 > ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND > a2<=>b2 > if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce > the table data before > Join by place groupby operator under join, that is > ==> > SELECT a1, a2 FROM >(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X >LEFT SEMI JOIN >(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y > ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 > then we can have smaller table data when execute join, because group by has > cut lots of > data. > > A pr will be submit soon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org