[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-29 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172709#comment-15172709
 ] 

Zhong Wang edited comment on SPARK-13337 at 2/29/16 10:05 PM:
--

For an outer join, it is difficult to eliminate the null columns from the 
result, because the null columns can come from both tables. The 
`join-using-column` interface can automatically eliminate those columns, which 
are very convenient. Sorry that I missed this point in my last reply.


was (Author: zwang):
For an outer join, it is difficult to eliminate the null columns from the 
result. The `join-using-column` interface can automatically eliminate those 
columns, which are very convenient. Sorry that I missed this point in my last 
reply.

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-29 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172881#comment-15172881
 ] 

Zhong Wang edited comment on SPARK-13337 at 2/29/16 11:40 PM:
--

It doesn't help in my case, because it doesn't support null-safe joins. It 
would be great if there is an interface like:

{code}
def join(right: DataFrame, usingColumns: Seq[String], joinType: String, 
nullSafe:Boolean): DataFrame
{code}

The current join-using-column interface works great if the joining tables 
doesn't contain null values: it can eliminate the null columns generated from 
outer joins automatically. The general joining methods in your example support 
null-safe joins perfectly, but it cannot automatically eliminate the null 
columns, which are generated from outer joins.

Sorry that it is a little bit complicated here. Please let me know if you need 
a concrete example.


was (Author: zwang):
It doesn't help in my case, because it doesn't support null-safe joins. It 
would be great if there is an interface like:

{code}
def join(right: DataFrame, usingColumns: Seq[String], joinType: String, 
nullSafe:Boolean): DataFrame
{code}

It works great if the joining tables doesn't contain null values: it can 
eliminate the null columns generated from outer joins automatically. The 
general joining methods in your example support null-safe joins perfectly, but 
it cannot automatically eliminate the null columns, which are generated from 
outer joins.

Sorry that it is a little bit complicated here. Please let me know if you need 
a concrete example.

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-01 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175154#comment-15175154
 ] 

Zhong Wang edited comment on SPARK-13337 at 3/2/16 6:50 AM:


suppose we have two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
|null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case


was (Author: zwang):
suppose we have two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-01 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175154#comment-15175154
 ] 

Zhong Wang edited comment on SPARK-13337 at 3/2/16 6:50 AM:


suppose we are joining two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
|null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case


was (Author: zwang):
suppose we have two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
|null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org