[ https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997845#comment-15997845 ]
Xiao Li commented on SPARK-20545: --------------------------------- Please reopen it if you still hit this issue. Thanks! > union set operator should default to DISTINCT and not ALL semantics > ------------------------------------------------------------------- > > Key: SPARK-20545 > URL: https://issues.apache.org/jira/browse/SPARK-20545 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: N Campbell > > A set operation (i.e union) over two queries that produce identical row > values should return the distinct set of rows and not all rows. > ISO-SQL set operation semantics default to DISTINCT > SPARK implementation is defaulting to ALL > While SPARK allows DISTINCT keyword and some might assume ALL is faster, the > wrong result set semantically is produced per standard (and commercial SQL > systems including: ORACLE, DB2, Teradata, SQL Server etc.) > select tsint.csint from cert.tsint > union > select tint.cint from cert.tint > csint > <null> > -1 > 0 > 1 > 10 > <null> > -1 > 0 > 1 > 10 > vs > select tsint.csint from cert.tsint union distinct select tint.cint from > cert.tint > csint > -1 > <null> > 1 > 10 > 0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org