[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107749#comment-16107749 ]
Ruslan Dautkhanov commented on SPARK-21274: ------------------------------------------- [~viirya], you're right. I've checked now on both PostgreSQL and Oracle. Sorry for the confusion. So the more generic rule is, [from Oracle documentation|https://docs.oracle.com/database/121/SQLRF/operators006.htm#sthref913]: {quote} For example, if a particular value occurs *m* times in nested_table1 and *n* times in nested_table2, then the result would contain the element *min(m,n)* times. ALL is the default. {quote} It's interesting that Oracle doesn't support "intersect all" on simple table sets, but only on nested table sets (through "multisets"): {code} CREATE TYPE sets_test_typ AS object ( num number ); CREATE TYPE sets_test_tab_typ AS TABLE OF sets_test_typ; with tab1 as ( select 1 as z from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual ) , tab2 as ( select 1 as z from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual ) SELECT * FROM table( cast(multiset(select z from tab1) as sets_test_tab_typ) multiset intersect ALL cast(multiset(select z from tab2) as sets_test_tab_typ) ) ; {code} So Oracle has returned "2" three times = min(3,4). Same test case in PostgreSQL: {code} scm=> with tab1 as ( select 1 as z scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> ) scm-> , tab2 as ( scm(> select 1 as z scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> ) scm-> SELECT z FROM tab1 scm-> INTERSECT all scm-> SELECT z FROM tab2 scm-> ; z --- 1 2 2 2 (4 rows) {code} The bottom line is that you're right, the above approach wouldn't work as you noticed. I still believe though it might be easier to implement except all / intersect all through query rewrite. For example, run group by on both sets, on target list of columns, do full outer join, find min between aggregate counts, and inject rows to final result set according that that min(n,m) count. > Implement EXCEPT ALL and INTERSECT ALL > -------------------------------------- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL > Affects Versions: 2.0.0, 2.1.0, 2.2.0 > Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROM tab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROM tab1 t1 > WHERE > NOT EXISTS > ( SELECT 1 > FROM t1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) > ) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org