[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172709#comment-15172709 ] Zhong Wang edited comment on SPARK-13337 at 2/29/16 10:05 PM: -- For an outer join, it is difficult to eliminate the null columns from the result, because the null columns can come from both tables. The `join-using-column` interface can automatically eliminate those columns, which are very convenient. Sorry that I missed this point in my last reply. was (Author: zwang): For an outer join, it is difficult to eliminate the null columns from the result. The `join-using-column` interface can automatically eliminate those columns, which are very convenient. Sorry that I missed this point in my last reply. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172881#comment-15172881 ] Zhong Wang edited comment on SPARK-13337 at 2/29/16 11:40 PM: -- It doesn't help in my case, because it doesn't support null-safe joins. It would be great if there is an interface like: {code} def join(right: DataFrame, usingColumns: Seq[String], joinType: String, nullSafe:Boolean): DataFrame {code} The current join-using-column interface works great if the joining tables doesn't contain null values: it can eliminate the null columns generated from outer joins automatically. The general joining methods in your example support null-safe joins perfectly, but it cannot automatically eliminate the null columns, which are generated from outer joins. Sorry that it is a little bit complicated here. Please let me know if you need a concrete example. was (Author: zwang): It doesn't help in my case, because it doesn't support null-safe joins. It would be great if there is an interface like: {code} def join(right: DataFrame, usingColumns: Seq[String], joinType: String, nullSafe:Boolean): DataFrame {code} It works great if the joining tables doesn't contain null values: it can eliminate the null columns generated from outer joins automatically. The general joining methods in your example support null-safe joins perfectly, but it cannot automatically eliminate the null columns, which are generated from outer joins. Sorry that it is a little bit complicated here. Please let me know if you need a concrete example. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175154#comment-15175154 ] Zhong Wang edited comment on SPARK-13337 at 3/2/16 6:50 AM: suppose we have two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| |null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case was (Author: zwang): suppose we have two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175154#comment-15175154 ] Zhong Wang edited comment on SPARK-13337 at 3/2/16 6:50 AM: suppose we are joining two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| |null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case was (Author: zwang): suppose we have two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| |null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org