[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948800#comment-16948800
 ] 

Mukul Murthy commented on SPARK-29358:
--------------------------------------

That would be a start to make us not have to do #1, but #2 is still annoying. 
Serializing and deserializing the data just to merge schemas is clunky, and 
transforming each DataFrame's schema is annoying enough to use when you don't 
have StructTypes and nested columns. When you add those into the picture, it 
gets even messier.

 

 

> Make unionByName optionally fill missing columns with nulls
> -----------------------------------------------------------
>
>                 Key: SPARK-29358
>                 URL: https://issues.apache.org/jira/browse/SPARK-29358
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +----+----+ 
> | x| y| 
> +----+----+ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +----+----+
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to