[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925310#comment-15925310
 ] 

Hyukjin Kwon commented on SPARK-19954:
--------------------------------------

Is this a blocker BTW?

{quote}
pointless to release without this change as the release would be unusable to a 
large minority of users
{quote}

> Joining to a unioned DataFrame does not produce expected result.
> ----------------------------------------------------------------
>
>                 Key: SPARK-19954
>                 URL: https://issues.apache.org/jira/browse/SPARK-19954
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Arun Allamsetty
>            Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---+----+---+----+----+----+----+
> | id|name| id|name|colA|colB|colC|
> +---+----+---+----+----+----+----+
> |  1|   a|  1|   a|true|null|null|
> +---+----+---+----+----+----+----+
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---+----+---+----+----+----+----+
> | id|name| id|name|colA|colB|colC|
> +---+----+---+----+----+----+----+
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---+----+---+----+----+----+----+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to