[jira] [Commented] (SPARK-20545) union set operator should default to DISTINCT and not ALL semantics
[ https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997845#comment-15997845 ] Xiao Li commented on SPARK-20545: - Please reopen it if you still hit this issue. Thanks! > union set operator should default to DISTINCT and not ALL semantics > --- > > Key: SPARK-20545 > URL: https://issues.apache.org/jira/browse/SPARK-20545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: N Campbell > > A set operation (i.e union) over two queries that produce identical row > values should return the distinct set of rows and not all rows. > ISO-SQL set operation semantics default to DISTINCT > SPARK implementation is defaulting to ALL > While SPARK allows DISTINCT keyword and some might assume ALL is faster, the > wrong result set semantically is produced per standard (and commercial SQL > systems including: ORACLE, DB2, Teradata, SQL Server etc.) > select tsint.csint from cert.tsint > union > select tint.cint from cert.tint > csint > > -1 > 0 > 1 > 10 > > -1 > 0 > 1 > 10 > vs > select tsint.csint from cert.tsint union distinct select tint.cint from > cert.tint > csint > -1 > > 1 > 10 > 0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20545) union set operator should default to DISTINCT and not ALL semantics
[ https://issues.apache.org/jira/browse/SPARK-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997843#comment-15997843 ] Xiao Li commented on SPARK-20545: - You can try {noformat} select 3 as `col` union select 3 as `col` {noformat} It outputs 3. In Spark SQL, if neither ALL nor DISTINCT is used, DISTINCT behavior is the default. > union set operator should default to DISTINCT and not ALL semantics > --- > > Key: SPARK-20545 > URL: https://issues.apache.org/jira/browse/SPARK-20545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: N Campbell > > A set operation (i.e union) over two queries that produce identical row > values should return the distinct set of rows and not all rows. > ISO-SQL set operation semantics default to DISTINCT > SPARK implementation is defaulting to ALL > While SPARK allows DISTINCT keyword and some might assume ALL is faster, the > wrong result set semantically is produced per standard (and commercial SQL > systems including: ORACLE, DB2, Teradata, SQL Server etc.) > select tsint.csint from cert.tsint > union > select tint.cint from cert.tint > csint > > -1 > 0 > 1 > 10 > > -1 > 0 > 1 > 10 > vs > select tsint.csint from cert.tsint union distinct select tint.cint from > cert.tint > csint > -1 > > 1 > 10 > 0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org