[ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039769#comment-17039769
 ] 

Hyukjin Kwon commented on SPARK-27282:
--------------------------------------

Spark 2.3.x is EOL. Let's reopen when we can confirm this issue persists in 
Spark 2.4.x+.

> Spark incorrect results when using UNION with GROUP BY clause
> -------------------------------------------------------------
>
>                 Key: SPARK-27282
>                 URL: https://issues.apache.org/jira/browse/SPARK-27282
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell, Spark Submit, SQL
>    Affects Versions: 2.3.2
>         Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>            Reporter: Sofia
>            Priority: Major
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>    .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to