[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826188#comment-15826188 ] Arjun commented on SPARK-13337: --- I ran into the same issue for outer join using spark 2.0. Is this issue also there in other join types? > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182962#comment-15182962 ] Takeshi Yamamuro commented on SPARK-13337: -- Oh, I got your point ;) However, it seems that all other joins in DataFrame preserve both key columns in two input tables. I'm not sure it is okay to drop one side column of them in an output schema. How about making a pr and discussing in github if it is easy to fix? > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179339#comment-15179339 ] Zhong Wang commented on SPARK-13337: The current join method with usingColumns argument generates result like TableC. The limitation is that it doesn't support null-safe join. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175548#comment-15175548 ] Takeshi Yamamuro commented on SPARK-13337: -- ISTM an interface to get TableC directly is confusing for other users. Any real and common use-cases to use this interface frequently? > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175204#comment-15175204 ] Xiao Li commented on SPARK-13337: - To get your results, try using left outer join + right out join + union distinct. : ) > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175154#comment-15175154 ] Zhong Wang commented on SPARK-13337: suppose we have two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175097#comment-15175097 ] Xiao Li commented on SPARK-13337: - What is the null columns? If you are using full outer joins, all the columns in the result sets of joins could be null columns. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172881#comment-15172881 ] Zhong Wang commented on SPARK-13337: It doesn't help in my case, because it doesn't support null-safe joins. It would be great if there is an interface like: {code} def join(right: DataFrame, usingColumns: Seq[String], joinType: String, nullSafe:Boolean): DataFrame {code} It works great if the joining tables doesn't contain null values: it can eliminate the null columns generated from outer joins automatically. The general joining methods in your example support null-safe joins perfectly, but it cannot automatically eliminate the null columns, which are generated from outer joins. Sorry that it is a little bit complicated here. Please let me know if you need a concrete example. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172796#comment-15172796 ] Xiao Li commented on SPARK-13337: - Sorry, I do not get your point. Join-using-columns does not help in your case, right? It just removes the overlapping columns but it does not filter the values in the results. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172709#comment-15172709 ] Zhong Wang commented on SPARK-13337: For an outer join, it is difficult to eliminate the null columns from the result. The `join-using-column` interface can automatically eliminate those columns, which are very convenient. Sorry that I missed this point in my last reply. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171297#comment-15171297 ] Xiao Li commented on SPARK-13337: - df1.join(df2, $"df1Key" <=> $"df2Key", "outer").select(xyz) You can select a set of columns to eliminate the redundant columns. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156579#comment-15156579 ] Zhong Wang commented on SPARK-13337: Unfortunately no... I use the join-on-columns function to performs a natural join. It can eliminate the redundant columns in the resulting table, which is required by our use case > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152225#comment-15152225 ] Takeshi Yamamuro commented on SPARK-13337: -- Is it not enough to use `df1.join(df2, $"df1Key" <=> $"df2Key", "outer")` for your case? > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org