[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-11-01 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1010097145


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -217,3 +218,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;

Review Comment:
   Updating in https://github.com/apache/spark/pull/38460.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1010090254


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -217,3 +218,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;

Review Comment:
   Yes let me follow up. I guess I was looking at python side API somehow thus 
confused myself on the types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. 
Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   The second point is an addition for the first point.  If there is a field 
which is not set, there could be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the 
default value that Spark uses. In that case we don't need to differentiate the 
optionality. Otherwise we need this way to differentiate `set versus not set`, 
to adopt default values of Spark (unless we don't care the default values in 
Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. 
Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   The second point is an addition for the first point.  If there is a field 
which is not set, there could be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the 
default value that Spark uses. In that case we don't need to differentiate the 
optionally. Otherwise we need this way to differentiate `set versus not set`, 
to adopt default values of Spark (unless we don't care the default values in 
Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009836536


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   To really answer your question: if we plan to respect default values for 
Spark for those optionally fields whose default proto values are different from 
Spark default values, this is the only way to respect default values for Spark. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. 
Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   2. is an addition for 1.  If there is a field which is not set, there could 
be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the 
default value that Spark uses. In that case we don't need to differentiate the 
optionally. Otherwise we need this way to differentiate `set versus not set`, 
to adopt default values of Spark (unless we don't care the default values in 
Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009828791


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##
@@ -175,6 +178,28 @@ package object dsl {
   }
 
   object plans { // scalastyle:ignore
+implicit class DslMockRemoteSession(val session: MockRemoteSession) {
+  def range(
+  start: Option[Int],
+  end: Int,
+  step: Option[Int],
+  numPartitions: Option[Int]): Relation = {
+val range = proto.Range.newBuilder()

Review Comment:
   It makes sense for `SparkConnectPlanner` where Catalyst and Proto are both 
mixed together, and we are keeping the approach you are asking there.
   
   However this is the Connect DSL that only deal with protos. No Catalyst 
included in this package: 
https://github.com/apache/spark/blob/9fc3aa0b1c092ab1f13b26582e3ece7440fbfc3b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala#L17



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009792973


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##
@@ -175,6 +178,28 @@ package object dsl {
   }
 
   object plans { // scalastyle:ignore
+implicit class DslMockRemoteSession(val session: MockRemoteSession) {
+  def range(
+  start: Option[Int],
+  end: Int,
+  step: Option[Int],
+  numPartitions: Option[Int]): Relation = {
+val range = proto.Range.newBuilder()

Review Comment:
   Note that I need to keep `proto.Range` as `Range` itself is a built-in scala 
class so we need `proto.` to differentiate for this special case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009085869


##
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##
@@ -173,6 +175,15 @@ class SparkConnectProtoSuite extends PlanTest with 
SparkConnectPlanTest {
 comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("Test Range") {
+comparePlansDatasetLong(connect.range(None, 10, None, None), 
spark.range(10))

Review Comment:
   Oh the `.toDF()` just convert things into DataFrame. 
   
   It has removed the `comparePlansDatasetLong`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009077783


##
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##
@@ -173,6 +175,15 @@ class SparkConnectProtoSuite extends PlanTest with 
SparkConnectPlanTest {
 comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("Test Range") {
+comparePlansDatasetLong(connect.range(None, 10, None, None), 
spark.range(10))

Review Comment:
   Let me try to see if it gives an exact plan.
   
   
   Another idea might be we just compare the result through `collect()` so we 
do not compare the plan on this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-23 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002851496


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;

Review Comment:
   Yeah this becomes tricky. Ultimately we can wrap every such field into a 
message so we always know if that field is set or not set. However that might 
complicate entire proto too much.. Let's have a discussion on that.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-23 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002850714


##
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectSessionBasedSuite.scala:
##
@@ -32,7 +35,9 @@ trait SparkConnectPlanTestWithSparkSession extends 
SharedSparkSession with Spark
   override def getSession(): SparkSession = spark
 }
 
-class SparkConnectDeduplicateSuite extends 
SparkConnectPlanTestWithSparkSession {
+class SparkConnectSessionBasedSuite extends 
SparkConnectPlanTestWithSparkSession {

Review Comment:
   I am working on a refactoring PR but that is blocked by some gaps between 
current Connect proto and DataFrame API. Let's have a discussion on how to 
unblock it.
   
   With the refactoring I expect we only have one suite that test Connect proto 
with DataFrame API.
   
   Before that is done, right now we have two suites, one is with DataFrame API 
and another one is with Catalyst.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-23 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002850714


##
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectSessionBasedSuite.scala:
##
@@ -32,7 +35,9 @@ trait SparkConnectPlanTestWithSparkSession extends 
SharedSparkSession with Spark
   override def getSession(): SparkSession = spark
 }
 
-class SparkConnectDeduplicateSuite extends 
SparkConnectPlanTestWithSparkSession {
+class SparkConnectSessionBasedSuite extends 
SparkConnectPlanTestWithSparkSession {

Review Comment:
   I am working on a refactoring PR but that is blocked. Let's have a 
discussion on how to unblock it.
   
   With the refactoring I expect we only have one suite that test Connect proto 
with DataFrame API.
   
   Before that is done, right now we have two suites, one is with DataFrame API 
and another one is with Catalyst.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org