[ https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480438#comment-16480438 ]
Yu-Jhe Li commented on SPARK-23614: ----------------------------------- Is this bug happening only when 1) cached dataframe 2) aggregation? > Union produces incorrect results when caching is used > ----------------------------------------------------- > > Key: SPARK-23614 > URL: https://issues.apache.org/jira/browse/SPARK-23614 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Morten Hornbech > Assignee: Liang-Chi Hsieh > Priority: Major > Labels: correctness > Fix For: 2.3.1, 2.4.0 > > > We just upgraded from 2.2 to 2.3 and our test suite caught this error: > {code:java} > case class TestData(x: Int, y: Int, z: Int) > val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, > 6))).cache() > val group1 = frame.groupBy("x").agg(min(col("y")) as "value") > val group2 = frame.groupBy("x").agg(min(col("z")) as "value") > group1.union(group2).show() > // +---+-----+ > // | x|value| > // +---+-----+ > // | 1| 2| > // | 4| 5| > // | 1| 2| > // | 4| 5| > // +---+-----+ > group2.union(group1).show() > // +---+-----+ > // | x|value| > // +---+-----+ > // | 1| 3| > // | 4| 6| > // | 1| 3| > // | 4| 6| > // +---+-----+ > {code} > The error disappears if the first data frame is not cached or if the two > group by's use separate copies. I'm not sure exactly what happens on the > insides of Spark, but errors that produce incorrect results rather than > exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org