[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-05 Thread arman1371
Github user arman1371 closed the pull request at:

https://github.com/apache/spark/pull/22889


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230614726
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

Normally, we do not add such an API. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread arman1371
Github user arman1371 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230575855
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

Yes, for scala api it's just 5 characters but in java api it's very hard to 
change `String` to `Seq[String]`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230571447
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
--- End diff --

Please fix the indentation here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230571437
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

@arman1371 . We understand that this PR is trying to add a syntactic sugar. 
But you need only 5 characters, `Seq(` and `)`, to use the existing general 
API. Personally, I agree with @wangyum . I prefer not to add this.

Historically, 
1. Spark 1.4 adds `Seq[String]` version was added later to support PySpark 
(SPARK-7990)
2. Spark 1.6 adds `join` type to `Seq[String]` version (SPARK-10446)

It's a long time ago. Given that, I guess Apache Spark community 
intentionally didn't add the `String` version for this in order to keep 
`Dataset` simple in terms of the number of APIs. Anyway, since you need an 
answer, let's ask the general opinion again to make a decision.

Hi, @rxin, @cloud-fan , @gatorsmile  . Did we explicitly decide not to add 
this API ? It seems that @arman1371 wants to add this for feature parity with 
PySpark at Spark 3.0.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230570164
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

cc @dongjoon-hyun


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread arman1371
Github user arman1371 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230560687
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

no, it's a simpler implementation. as i said we have both of `def 
join(right: Dataset[_], usingColumn: String)` and `def join(right: Dataset[_], 
usingColumns: Seq[String])`. based on your opinion the first function should be 
removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230559647
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

Cloud we close this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread arman1371
Github user arman1371 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230559472
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

The answer of both questions are yes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230557937
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

@arman1371 What do you think? ```def join(right: Dataset[_], usingColumn: 
String, joinType: String)``` only support one column. right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230555316
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

So in your case. Could you replace ```df1.join(df2, "user_id", "left")``` 
with ```df1.join(df2, Seq("user_id"), "left")```?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread arman1371
Github user arman1371 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230554936
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

It uses String parameter for usingColumns instead of Seq[String].

It is like `def join(right: Dataset[_], usingColumn: String)` and `def 
join(right: Dataset[_], usingColumns: Seq[String])` that was implemented before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230554727
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

What is the difference between this function and ```def join(right: 
Dataset[_], usingColumns: Seq[String], joinType: String)```?

https://github.com/apache/spark/blob/v2.4.0/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L932-L946


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-10-30 Thread arman1371
GitHub user arman1371 opened a pull request:

https://github.com/apache/spark/pull/22889

[SPARK-25882][SQL] Added a function to join two datasets using one column 
with join type parameter

## What changes were proposed in this pull request?
Added a function to join two datasets using one column with join type 
parameter

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arman1371/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22889.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22889


commit 16d07c6da03706b206f0d24317e2ff8b1e489cf3
Author: Arman 
Date:   2018-10-30T08:46:11Z

Added function to join two datasets using one column and get the join type 
as a parameter




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org