[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

Michael Armbrust (Jira) Tue, 31 Mar 2020 09:52:42 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071968#comment-17071968
 ]


Michael Armbrust commented on SPARK-29358:
------------------------------------------

I think we should reconsider closing this as won't fix:
 - I think the semantics of this operation make sense. We already have this 
with writing JSON or parquet data. It is just a really inefficient way to 
accomplish the end goal.
 - I don't think it is a problem to move "away from SQL union". This is a 
clearly named, different operation. IMO this one makes *more* sense than SQL 
union. It is much more likely that columns with the same name are semantically 
equivalent than columns at the same ordinal with different names.
 - We are not breaking the behavior of unionByName. Currently it throws an 
exception in these cases. We are making more data transformations possible, but 
anything that was working before will continue to work. You could add a boolean 
flag if you were really concerned, but I think I would skip that.

> Make unionByName optionally fill missing columns with nulls
> -----------------------------------------------------------
>
>                 Key: SPARK-29358
>                 URL: https://issues.apache.org/jira/browse/SPARK-29358
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +----+----+ 
> | x| y| 
> +----+----+ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +----+----+
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

Reply via email to