[jira] [Commented] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2019-06-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858402#comment-16858402
 ] 

Hyukjin Kwon commented on SPARK-27785:
--

To me, I don't have much information about how common this typed API is. If 
this is common enough and something asked frequently somewhere, might be worth 
doing it. The problem sounds valid but I feel like missing about the importance 
in this API.

For instance, we won't probably expose such API from 1 to 22 arguments like UDF.

> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3],
>   condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2019-05-30 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852089#comment-16852089
 ] 

Josh Rosen commented on SPARK-27785:


I think this might require a little bit more design work before beginning 
implementation, especially along the following dimensions:
 # What arity do we max out at? At work I have some use-cases involving joins 
of 10+ tables.
 # Do we put this into {{Dataset}}? Or factor it out into a separate 
{{Multijoin}} helper object? Is the precedence for this in other frameworks 
(Scalding, Flink, etc) which we could mirror?
 # Instead of writing out each case by hand, can we write some Scala code to 
generate the signatures / code for us (similar to how the {{ScalaUDF}} 
overloads were defined)?
 # In addition to saving some typing / projection, does this new API let us 
solve SPARK-19468 for a limited subset of cases?

> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3],
>   condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2019-05-28 Thread Swapnil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849880#comment-16849880
 ] 

Swapnil commented on SPARK-27785:
-

I like this idea. I can start working on it if it seems to be promising. 
[~hyukjin.kwon] - Do you have any initial thoughts/suggestions on this proposal?

> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3],
>   condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org