[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames

Skyler Lehan (JIRA) Mon, 28 Jan 2019 09:59:21 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Skyler Lehan updated SPARK-26739:
---------------------------------
    Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the join functions on 
[DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
 the join types are defined via a string parameter called joinType. In order 
for a developer to know which joins are possible, they must look up the API 
call for join. While this works fine, it can cause the developer to make a typo 
resulting in improper joins and/or unexpected errors that aren't evident at 
compile time. The objective of this improvement would be to allow developers to 
use a common definition for join types (by enum or constants) called JoinTypes. 
This would contain the possible joins and remove the possibility of a typo. It 
would also allow Spark to alter the names of the joins in the future without 
impacting end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType): 
DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
DataFrame
{code}
A new enum would be created called JoinType. Developers would be encouraged to 
adopt using JoinType instead of the literal strings.

*If string constants are used:*

No current API changes, however a new Scala object with string constants would 
be defined like so:
{code:java}
object JoinType {
  final val INNER: String = "inner"
  final val LEFT_OUTER: String = "left_outer"
}
{code}
This approach would not allow for compile time checking of the join types.

  was:
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the join functions on 
[DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
 the join types are defined via a string parameter called joinType. In order 
for a developer to know which joins are possible, they must look up the API 
call for join. While this works fine, it can cause the developer to make a typo 
resulting in improper joins and/or unexpected errors. The objective of this 
improvement would be to allow developers to use a common definition for join 
types (by enum or constants) called JoinTypes. This would contain the possible 
joins and remove the possibility of a typo. It would also allow Spark to alter 
the names of the joins in the future without impacting end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType): 
DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
DataFrame
{code}
A new enum would be created called JoinType. Developers would be encouraged to 
adopt using JoinType instead of the literal strings.

*If string constants are used:*

No current API changes, however a new Scala object with string constants would 
be defined like so:
{code:java}
object JoinType {
  final val INNER: String = "inner"
  final val LEFT_OUTER: String = "left_outer"
}
{code}
This approach would not allow for compile time checking of the join types.


> Standardized Join Types for DataFrames
> --------------------------------------
>
>                 Key: SPARK-26739
>                 URL: https://issues.apache.org/jira/browse/SPARK-26739
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Skyler Lehan
>            Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Currently, in the join functions on 
> [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
>  the join types are defined via a string parameter called joinType. In order 
> for a developer to know which joins are possible, they must look up the API 
> call for join. While this works fine, it can cause the developer to make a 
> typo resulting in improper joins and/or unexpected errors that aren't evident 
> at compile time. The objective of this improvement would be to allow 
> developers to use a common definition for join types (by enum or constants) 
> called JoinTypes. This would contain the possible joins and remove the 
> possibility of a typo. It would also allow Spark to alter the names of the 
> joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything 
> other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading 
> to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be 
> successful?
> The new approach would use constants, something like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type 
> does and how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType 
> string parameter for users. If an enum is used, it would restrict the domain 
> of possible join types to whatever is defined in the future JoinType enum. 
> The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function 
> easier to wield and lead to less runtime errors. It will save time by 
> bringing join type validation at compile time. It will also provide in code 
> reference to the join types, which saves the developer time of having to look 
> up and navigate the multiple join functions to find the possible join types. 
> In addition to that, the resulting constants/enum would have documentation on 
> how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types 
> could be impacted if an enum is chosen as the common definition. This risk 
> can be mitigated by using string constants. The string constants would be the 
> exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join 
> functions could be added that take in these enums and the join calls that 
> contain string parameters for joinType could be deprecated. This would give 
> developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types 
> and additional join functions that take in the join type enum/constant. The 
> final exam would be working tests written to check the functionality of these 
> new join functions and the join functions that take a string for joinType 
> would be deprecated.
> h3. *Appendix A.* Proposed API Changes. Optional section defining APIs 
> changes, if any. Backward and forward compatibility must be taken into 
> account.
> *If enums are used:*
> The following join function signatures would be added to the Dataset API:
> {code:java}
> def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType): 
> DataFrame
> {code}
> The following functions would be deprecated:
> {code:java}
> def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> A new enum would be created called JoinType. Developers would be encouraged 
> to adopt using JoinType instead of the literal strings.
> *If string constants are used:*
> No current API changes, however a new Scala object with string constants 
> would be defined like so:
> {code:java}
> object JoinType {
>   final val INNER: String = "inner"
>   final val LEFT_OUTER: String = "left_outer"
> }
> {code}
> This approach would not allow for compile time checking of the join types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames

Reply via email to