[jira] [Updated] (SPARK-38193) [Spark Core] [Feature] change of unionByName parameter
[ https://issues.apache.org/jira/browse/SPARK-38193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38193: - Component/s: SQL (was: Spark Core) > [Spark Core] [Feature] change of unionByName parameter > -- > > Key: SPARK-38193 > URL: https://issues.apache.org/jira/browse/SPARK-38193 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.1 >Reporter: Daniel Davies >Priority: Minor > > Hello, > I had a quick question about the unionByName function. This function > currently seems to accept a parameter- "allowMissingColumns"- that allows > some tolerance to merging datasets with different schemas > [here|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2170]]; > but the implementation is currently a bit restrictive, i.e., with the second > parameter being a boolean, it is only possible to make unionByName add all > columns from both dataframes at the moment. We have other use cases in our > workflows- for example, to take only column names that are in both dataframes > (and I'm assuming that other users will have different merge strategies in > mind also). Does it seem reasonable to extend the parameter from > "allowMissingColumns" to a "mode" string-type parameter natively in Spark? If > so, I'm happy to make a PR to achieve this (the change would involve amending > the > [ResolveUnion.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala] > utility to make it more flexible in merging columns; to a user it would look > a lot more like the 'join' operator, where a join strategy is selected). > I've posted this question on the dev mailing list also; happy to continue the > conversation there if that is preferable. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38193) [Spark Core] [Feature] change of unionByName parameter
[ https://issues.apache.org/jira/browse/SPARK-38193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38193: - Affects Version/s: 3.3.0 (was: 3.2.1) > [Spark Core] [Feature] change of unionByName parameter > -- > > Key: SPARK-38193 > URL: https://issues.apache.org/jira/browse/SPARK-38193 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Daniel Davies >Priority: Minor > > Hello, > I had a quick question about the unionByName function. This function > currently seems to accept a parameter- "allowMissingColumns"- that allows > some tolerance to merging datasets with different schemas > [here|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2170]]; > but the implementation is currently a bit restrictive, i.e., with the second > parameter being a boolean, it is only possible to make unionByName add all > columns from both dataframes at the moment. We have other use cases in our > workflows- for example, to take only column names that are in both dataframes > (and I'm assuming that other users will have different merge strategies in > mind also). Does it seem reasonable to extend the parameter from > "allowMissingColumns" to a "mode" string-type parameter natively in Spark? If > so, I'm happy to make a PR to achieve this (the change would involve amending > the > [ResolveUnion.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala] > utility to make it more flexible in merging columns; to a user it would look > a lot more like the 'join' operator, where a join strategy is selected). > I've posted this question on the dev mailing list also; happy to continue the > conversation there if that is preferable. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org