[ https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039769#comment-17039769 ]
Hyukjin Kwon commented on SPARK-27282: -------------------------------------- Spark 2.3.x is EOL. Let's reopen when we can confirm this issue persists in Spark 2.4.x+. > Spark incorrect results when using UNION with GROUP BY clause > ------------------------------------------------------------- > > Key: SPARK-27282 > URL: https://issues.apache.org/jira/browse/SPARK-27282 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, SQL > Affects Versions: 2.3.2 > Environment: I'm using : > IntelliJ IDEA ==> 2018.1.4 > spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1) > scala ==> 2.11.8 > Reporter: Sofia > Priority: Major > > When using UNION clause after a GROUP BY clause in spark, the results > obtained are wrong. > The following example explicit this issue: > {code:java} > CREATE TABLE test_un ( > col1 varchar(255), > col2 varchar(255), > col3 varchar(255), > col4 varchar(255) > ); > INSERT INTO test_un (col1, col2, col3, col4) > VALUES (1,1,2,4), > (1,1,2,4), > (1,1,3,5), > (2,2,2,null); > {code} > I used the following code : > {code:java} > val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un") > val y = x > .filter(col("col4")isNotNull) > .groupBy("col1", "col2","col3") > .agg(count(col("col3")).alias("cnt")) > .withColumn("col_name", lit("col3")) > .select(col("col1"), col("col2"), > col("col_name"),col("col3").alias("col_value"), col("cnt")) > val z = x > .filter(col("col4")isNotNull) > .groupBy("col1", "col2","col4") > .agg(count(col("col4")).alias("cnt")) > .withColumn("col_name", lit("col4")) > .select(col("col1"), col("col2"), > col("col_name"),col("col4").alias("col_value"), col("cnt")) > y.union(z).show() > {code} > And i obtained the following results: > ||col1||col2||col_name||col_value||cnt|| > |1|1|col3|5|1| > |1|1|col3|4|2| > |1|1|col4|5|1| > |1|1|col4|4|2| > Expected results: > ||col1||col2||col_name||col_value||cnt|| > |1|1|col3|3|1| > |1|1|col3|2|2| > |1|1|col4|4|2| > |1|1|col4|5|1| > But when i remove the last row of the table, i obtain the correct results. > {code:java} > (2,2,2,null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org