[jira] [Updated] (SPARK-38193) [Spark Core] [Feature] change of unionByName parameter

2022-02-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38193:
-
Component/s: SQL
 (was: Spark Core)

> [Spark Core] [Feature] change of unionByName parameter
> --
>
> Key: SPARK-38193
> URL: https://issues.apache.org/jira/browse/SPARK-38193
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Daniel Davies
>Priority: Minor
>
> Hello,
> I had a quick question about the unionByName function. This function 
> currently seems to accept a parameter- "allowMissingColumns"- that allows 
> some tolerance to merging datasets with different schemas 
> [here|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2170]];
>  but the implementation is currently a bit restrictive, i.e., with the second 
> parameter being a boolean, it is only possible to make unionByName add all 
> columns from both dataframes at the moment. We have other use cases in our 
> workflows- for example, to take only column names that are in both dataframes 
> (and I'm assuming that other users will have different merge strategies in 
> mind also). Does it seem reasonable to extend the parameter from 
> "allowMissingColumns" to a "mode" string-type parameter natively in Spark? If 
> so, I'm happy to make a PR to achieve this (the change would involve amending 
> the 
> [ResolveUnion.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala]
>  utility to make it more flexible in merging columns; to a user it would look 
> a lot more like the 'join' operator, where a join strategy is selected). 
> I've posted this question on the dev mailing list also; happy to continue the 
> conversation there if that is preferable.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38193) [Spark Core] [Feature] change of unionByName parameter

2022-02-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38193:
-
Affects Version/s: 3.3.0
   (was: 3.2.1)

> [Spark Core] [Feature] change of unionByName parameter
> --
>
> Key: SPARK-38193
> URL: https://issues.apache.org/jira/browse/SPARK-38193
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Daniel Davies
>Priority: Minor
>
> Hello,
> I had a quick question about the unionByName function. This function 
> currently seems to accept a parameter- "allowMissingColumns"- that allows 
> some tolerance to merging datasets with different schemas 
> [here|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2170]];
>  but the implementation is currently a bit restrictive, i.e., with the second 
> parameter being a boolean, it is only possible to make unionByName add all 
> columns from both dataframes at the moment. We have other use cases in our 
> workflows- for example, to take only column names that are in both dataframes 
> (and I'm assuming that other users will have different merge strategies in 
> mind also). Does it seem reasonable to extend the parameter from 
> "allowMissingColumns" to a "mode" string-type parameter natively in Spark? If 
> so, I'm happy to make a PR to achieve this (the change would involve amending 
> the 
> [ResolveUnion.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala]
>  utility to make it more flexible in merging columns; to a user it would look 
> a lot more like the 'join' operator, where a join strategy is selected). 
> I've posted this question on the dev mailing list also; happy to continue the 
> conversation there if that is preferable.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org