[ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754630#comment-16754630 ]
Hyukjin Kwon commented on SPARK-26739: -------------------------------------- So, is it just to propose constant variables for join types? > Standardized Join Types for DataFrames > -------------------------------------- > > Key: SPARK-26739 > URL: https://issues.apache.org/jira/browse/SPARK-26739 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: Skyler Lehan > Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > h3. *Q1.* What are you trying to do? Articulate your objectives using > absolutely no jargon. > Currently, in the join functions on > [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset], > the join types are defined via a string parameter called joinType. In order > for a developer to know which joins are possible, they must look up the API > call for join. While this works fine, it can cause the developer to make a > typo resulting in improper joins and/or unexpected errors that aren't evident > at compile time. The objective of this improvement would be to allow > developers to use a common definition for join types (by enum or constants) > called JoinTypes. This would contain the possible joins and remove the > possibility of a typo. It would also allow Spark to alter the names of the > joins in the future without impacting end-users. > h3. *Q2.* What problem is this proposal NOT designed to solve? > The problem this solves is extremely narrow, it would not solve anything > other than providing a common definition for join types. > h3. *Q3.* How is it done today, and what are the limits of current practice? > Currently, developers must join two DataFrames like so: > {code:java} > val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), > "left_outer") > {code} > Where they manually type the join type. As stated above, this: > * Requires developers to manually type in the join > * Can cause possibility of typos > * Restricts renaming of join types as its a literal string > * Does not restrict and/or compile check the join type being used, leading > to runtime errors > h3. *Q4.* What is new in your approach and why do you think it will be > successful? > The new approach would use constants, something like this: > {code:java} > val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), > JoinType.LEFT_OUTER) > {code} > This would provide: > * In code reference/definitions of the possible join types > ** This subsequently allows the addition of scaladoc of what each join type > does and how it operates > * Removes possibilities of a typo on the join type > * Provides compile time checking of the join type (only if an enum is used) > To clarify, if JoinType is a constant, it would just fill in the joinType > string parameter for users. If an enum is used, it would restrict the domain > of possible join types to whatever is defined in the future JoinType enum. > The enum is preferred, however it would take longer to implement. > h3. *Q5.* Who cares? If you are successful, what difference will it make? > Developers using Apache Spark will care. This will make the join function > easier to wield and lead to less runtime errors. It will save time by > bringing join type validation at compile time. It will also provide in code > reference to the join types, which saves the developer time of having to look > up and navigate the multiple join functions to find the possible join types. > In addition to that, the resulting constants/enum would have documentation on > how that join type works. > h3. *Q6.* What are the risks? > Users of Apache Spark who currently use strings to define their join types > could be impacted if an enum is chosen as the common definition. This risk > can be mitigated by using string constants. The string constants would be the > exact same string as the string literals used today. For example: > {code:java} > JoinType.INNER = "inner" > {code} > If an enum is still the preferred way of defining the join types, new join > functions could be added that take in these enums and the join calls that > contain string parameters for joinType could be deprecated. This would give > developers a chance to change over to the new join types. > h3. *Q7.* How long will it take? > A few days for a seasoned Spark developer. > h3. *Q8.* What are the mid-term and final "exams" to check for success? > Mid-term exam would be the addition of a common definition of the join types > and additional join functions that take in the join type enum/constant. The > final exam would be working tests written to check the functionality of these > new join functions and the join functions that take a string for joinType > would be deprecated. > h3. *Appendix A.* Proposed API Changes. Optional section defining APIs > changes, if any. Backward and forward compatibility must be taken into > account. > *If enums are used:* > The following join function signatures would be added to the Dataset API: > {code:java} > def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame > def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType): > DataFrame > {code} > The following functions would be deprecated: > {code:java} > def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame > def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): > DataFrame > {code} > A new enum would be created called JoinType. Developers would be encouraged > to adopt using JoinType instead of the literal strings. > *If string constants are used:* > No current API changes, however a new Scala object with string constants > would be defined like so: > {code:java} > object JoinType { > final val INNER: String = "inner" > final val LEFT_OUTER: String = "left_outer" > } > {code} > This approach would not allow for compile time checking of the join types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org