[ 
https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849016#comment-16849016
 ] 

Liang-Chi Hsieh commented on SPARK-27855:
-----------------------------------------

If you notice, the printed schema of two Datasets is different. The columns 
have different order. Dataset.union resolves columns by position. This is well 
documented in the API doc.

If you want to resolve columns by name, please use Dataset.unionByName API.

> Union failed between 2 datasets of the same type converted from different 
> dataframes
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-27855
>                 URL: https://issues.apache.org/jira/browse/SPARK-27855
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.3
>            Reporter: Hao Ren
>            Priority: Major
>
> 2 Datasets of the same type converted from different dataframes can not union.
> Here is the code to reproduce the problem. It seems `union` just checks the 
> schema of the orignal dataframe, even if the two datasets have already been 
> converted to the same type of dataset.
> {code:java}
> case class Entity(key: Int, a: Int, b: String)
> val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
> val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
> df1.printSchema
> df2.printSchema
> df1 union df2
> {code}
> Result
> {code:java}
> defined class Entity
> df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more 
> field]
> df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
> field]
> converted
> root
> |-- key: integer (nullable = false)
> |-- a: integer (nullable = false)
> |-- b: string (nullable = true)
> root
> |-- key: integer (nullable = false)
> |-- b: string (nullable = true)
> |-- a: integer (nullable = false)
> org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
> as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "a")
> - root class: "Entity"{code}
> The problem is that the two datasets of the same type have different schemas.
> The schema of the dataset does not conserve the order of the fields in the 
> case class definition, but the one of the original dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to