[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465000#comment-16465000 ]
Liang-Chi Hsieh commented on SPARK-21274: ----------------------------------------- I read the design doc. It looks correct to me. I found a rewrite rule in Presto for INTERSECT that seems more simple to me at [https://github.com/prestodb/presto/issues/4918#issuecomment-207106688.] That rule can be used to do INTERSECT ALL and EXCEPT ALL, if I don't miss anything. For example, to do INTERSECT ALL like: {{SELECT a FROM foo INTERSECT ALL SELECT x FROM bar}} {{We can rewrite it as:}} {code:java} SELECT a FROM ( SELECT replicate_row(min_count, a) AS (min_count, a) FROM ( SELECT a, COUNT(foo_marker) AS foo_cnt, COUNT(bar_marker) AS bar_cnt, IF(COUNT(foo_marker) > COUNT(bar_marker), COUNT(bar_marker), COUNT(foo_marker)) AS min_count FROM ( SELECT a, true as foo_marker, null as bar_marker FROM foo UNION ALL SELECT x, null as foo_marker, true as bar_marker FROM bar ) T1 GROUP BY a) T2 WHERE foo_cnt >= 1 AND bar_cnt >= 1 ) {code} One advantage of that rewrite rule is the rules of INTERSECT ALL and EXCEPT ALL are more similar to each other. Another one is for INTERSECT ALL, it only needs one GROUP BY instead of three GROUP BY in current design. WDYT? > Implement EXCEPT ALL and INTERSECT ALL > -------------------------------------- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.0.0, 2.1.0, 2.2.0 > Reporter: Ruslan Dautkhanov > Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROM tab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROM tab1 t1 > WHERE > NOT EXISTS > ( SELECT 1 > FROM t1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) > ) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org