[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107749#comment-16107749
 ] 

Ruslan Dautkhanov commented on SPARK-21274:
-------------------------------------------

[~viirya], you're right. I've checked now on both PostgreSQL and Oracle. Sorry 
for the confusion.

So the more generic rule is, [from Oracle 
documentation|https://docs.oracle.com/database/121/SQLRF/operators006.htm#sthref913]:
 
{quote}
For example, if a particular value occurs *m* times in nested_table1 and *n* 
times in nested_table2, then the result would contain the element *min(m,n)* 
times. ALL is the default.
{quote}

It's interesting that Oracle doesn't support "intersect all" on simple table 
sets, but only on nested table sets (through "multisets"):

{code}
CREATE TYPE sets_test_typ AS object ( 
    num number );

CREATE TYPE sets_test_tab_typ AS
  TABLE OF sets_test_typ;

with tab1 as (    select 1 as z from dual 
        union all select 2 from dual
        union all select 2 from dual
        union all select 2 from dual
        union all select 2 from dual
        )
   , tab2 as (
                  select 1 as z from dual 
        union all select 2 from dual
        union all select 2 from dual
        union all select 2 from dual
    )
SELECT * 
FROM table(
            cast(multiset(select z from tab1) as sets_test_tab_typ)
            multiset intersect ALL
            cast(multiset(select z from tab2) as sets_test_tab_typ)
          )
;
{code}
So Oracle has returned "2" three times = min(3,4).

Same test case in PostgreSQL:
{code}
scm=> with tab1 as (    select 1 as z
scm(>         union all select 2
scm(>         union all select 2
scm(>         union all select 2
scm(>         union all select 2
scm(>         )
scm->    , tab2 as (
scm(>                   select 1 as z
scm(>         union all select 2
scm(>         union all select 2
scm(>         union all select 2
scm(>     )
scm-> SELECT z FROM tab1
scm->  INTERSECT all
scm-> SELECT z FROM tab2
scm-> ;
 z
---
 1
 2
 2
 2
(4 rows)

{code}

The bottom line is that you're right, the above approach wouldn't work as you 
noticed.

I still believe though it might be easier to implement except all / intersect 
all through query rewrite.
For example, run group by on both sets, on target list of columns, do full 
outer join, find min between aggregate counts, and inject rows to final result 
set according that that min(n,m) count.


> Implement EXCEPT ALL and INTERSECT ALL
> --------------------------------------
>
>                 Key: SPARK-21274
>                 URL: https://issues.apache.org/jira/browse/SPARK-21274
>             Project: Spark
>          Issue Type: New Feature
>          Components: Optimizer, SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: Ruslan Dautkhanov
>              Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
>      LEFT OUTER JOIN 
>         tab2 t2
>      ON (
>         (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>      )
> WHERE
>     COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
> WHERE 
>    NOT EXISTS
>    (    SELECT 1
>         FROM    t1_except_t2_df e
>         WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>    )
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to