[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema
[ https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875137#comment-15875137 ] Nick Dimiduk commented on SPARK-19615: -- Thanks for taking a look [~hyukjin.kwon]. These three bugs are indeed issues -- in all cases, it seems spark was not being careful to map column names to the appropriate column from each site of the union. My experience with 1.6.3 and 2.1.0 with unions has been much better. Actually, I still see echos of SPARK-9874 / SPARK-9813 when I extend one side or the other with null columns. I can file that as a separate issue if that's of interest to you. As for what RDBMS may or may not do, I'm not very aware or concerned. I'm thinking more about ease of use for a user. This is why I suggest perhaps a different union method that would encapsulate this behavior. Parsed spark sql can exhibit whatever semantics the community deems appropriate, while still giving users of the API access to this convenient functionality. I've implemented this logic in my application and it's quite complex. It would be very good for Spark to provide this for its users. > Provide Dataset union convenience for divergent schema > -- > > Key: SPARK-19615 > URL: https://issues.apache.org/jira/browse/SPARK-19615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk > > Creating a union DataFrame over two sources that have different schema > definitions is surprisingly complex. Provide a version of the union method > that will create a infer a target schema as the result of merging the > sources. Automatically add extend either side with {{null}} columns for any > missing columns that are nullable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema
[ https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873189#comment-15873189 ] Hyukjin Kwon commented on SPARK-19615: -- Let me leave loosely related JIRAs - SPARK-9813 , SPARK-9874 and SPARK-15918 > Provide Dataset union convenience for divergent schema > -- > > Key: SPARK-19615 > URL: https://issues.apache.org/jira/browse/SPARK-19615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk > > Creating a union DataFrame over two sources that have different schema > definitions is surprisingly complex. Provide a version of the union method > that will create a infer a target schema as the result of merging the > sources. Automatically add extend either side with {{null}} columns for any > missing columns that are nullable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema
[ https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873188#comment-15873188 ] Hyukjin Kwon commented on SPARK-19615: -- I remember I checked UNION operation in other DBMS and current behaviour is current and compliant. Could you maybe check and leave other references or DBMSes please? > Provide Dataset union convenience for divergent schema > -- > > Key: SPARK-19615 > URL: https://issues.apache.org/jira/browse/SPARK-19615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk > > Creating a union DataFrame over two sources that have different schema > definitions is surprisingly complex. Provide a version of the union method > that will create a infer a target schema as the result of merging the > sources. Automatically add extend either side with {{null}} columns for any > missing columns that are nullable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema
[ https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870513#comment-15870513 ] Nick Dimiduk commented on SPARK-19615: -- IMHO, a union operation should be as generous as possible. This facilitates common ETL and data cleansing operations where the sources are sparse-schema structures (JSON, HBase, Elastic Search, ). A couple examples of what I mean. Given dataframes of type {noformat} root |-- a: string (nullable = false) |-- b: string (nullable = true) {noformat} and {noformat} root |-- a: string (nullable = false) |-- c: string (nullable = true) {noformat} I would expect the union operation to infer the nullable columns from both sides to produce a dataframe of type {noformat} root |-- a: string (nullable = false) |-- b: string (nullable = true) |-- c: string (nullable = true) {noformat} This should work on an arbitrarily deep nesting of structs, so {noformat} root |-- a: string (nullable = false) |-- b: struct (nullable = false) ||-- b1: string (nullable = true) ||-- b2: string (nullable = true) {noformat} unioned with {noformat} root |-- a: string (nullable = false) |-- b: struct (nullable = false) ||-- b3: string (nullable = true) ||-- b4: string (nullable = true) {noformat} would result in {noformat} root |-- a: string (nullable = false) |-- b: struct (nullable = false) ||-- b1: string (nullable = true) ||-- b2: string (nullable = true) ||-- b3: string (nullable = true) ||-- b4: string (nullable = true) {noformat} > Provide Dataset union convenience for divergent schema > -- > > Key: SPARK-19615 > URL: https://issues.apache.org/jira/browse/SPARK-19615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk >Priority: Minor > > Creating a union DataFrame over two sources that have different schema > definitions is surprisingly complex. Provide a version of the union method > that will create a infer a target schema as the result of merging the > sources. Automatically add extend either side with {{null}} columns for any > missing columns that are nullable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org