[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925310#comment-15925310 ]
Hyukjin Kwon commented on SPARK-19954: -------------------------------------- Is this a blocker BTW? {quote} pointless to release without this change as the release would be unusable to a large minority of users {quote} > Joining to a unioned DataFrame does not produce expected result. > ---------------------------------------------------------------- > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Arun Allamsetty > Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---+----+---+----+----+----+----+ > | id|name| id|name|colA|colB|colC| > +---+----+---+----+----+----+----+ > | 1| a| 1| a|true|null|null| > +---+----+---+----+----+----+----+ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---+----+---+----+----+----+----+ > | id|name| id|name|colA|colB|colC| > +---+----+---+----+----+----+----+ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---+----+---+----+----+----+----+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org