N Campbell created SPARK-20545:
----------------------------------

             Summary: union set operator should default to DISTINCT and not ALL 
semantics
                 Key: SPARK-20545
                 URL: https://issues.apache.org/jira/browse/SPARK-20545
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: N Campbell


A set operation (i.e union) over two queries that produce identical row values 
should return the distinct set of rows and not all rows.

ISO-SQL set operation semantics default to DISTINCT 
SPARK implementation is defaulting to ALL
While SPARK allows DISTINCT keyword and some might assume ALL is faster, the 
wrong result set semantically is produced per standard (and commercial SQL 
systems including: ORACLE, DB2, Teradata, SQL Server etc.)

select tsint.csint from cert.tsint 
union 
select tint.cint from cert.tint 

csint
<null>
-1
0
1
10
<null>
-1
0
1
10


vs

select tsint.csint from cert.tsint union distinct select tint.cint from 
cert.tint 

csint
-1
<null>
1
10
0




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to