subject:"\[PR\] \[SPARK\-46812\]\[CONNECT\]\[PYTHON\] Make mapInPandas \/ mapInArrow support ResourceProfile \[spark\]"

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-04-01 Thread via GitHub



HyukjinKwon closed pull request #45232: [SPARK-46812][CONNECT][PYTHON] Make 
mapInPandas / mapInArrow support ResourceProfile
URL: https://github.com/apache/spark/pull/45232


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-04-01 Thread via GitHub



HyukjinKwon commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2030829152

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-04-01 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2030792790

   Hi @HyukjinKwon, Could you help merge it? Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-04-01 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2030753074

   > Please make sure that the follow work does not get lost.
   
   Sure, I will get it done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-04-01 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2029532177

   > In contrast to the regular spark API this implementation doesn't manage 
the lifecycle of of the resource request. Can you create a follow up Jira that 
removes the resource request from the spark context again?
   
   Hi @grundprinzip. Done, created a 
[JIRA](https://issues.apache.org/jira/browse/SPARK-47668) task.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-31 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1545872296


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -375,6 +375,9 @@ message ExecutePlanResponse {
 // Response type informing if the stream is complete in reattachable 
execution.
 ResultComplete result_complete = 14;
 
+// Response for command that creates ResourceProfile.
+CreateResourceProfileCommandResult create_resource_profile_command_result 
= 20;

Review Comment:
   Hi @grundprinzip, Thx for reviewing, Changed it to 17. Could you help review 
it again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-27 Thread via GitHub



grundprinzip commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1540499756


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -375,6 +375,9 @@ message ExecutePlanResponse {
 // Response type informing if the stream is complete in reattachable 
execution.
 ResultComplete result_complete = 14;
 
+// Response for command that creates ResourceProfile.
+CreateResourceProfileCommandResult create_resource_profile_command_result 
= 20;

Review Comment:
   Isn't the next ID 17 here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-20 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-2009170881

   Hi @grundprinzip, @HyukjinKwon, @zhengruifeng, This PR has been there for a 
while, could you help review/merge it? Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-13 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1996052736

   Hi @grundprinzip, I would be grateful if you could kindly take another look 
at this PR, Thx.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-11 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1989639850

   Hi @grundprinzip, Could you help review it again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-11 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519241310


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   I'm not quite following you about "by uuids".
   
   So the basic implementation is
   
   1. The client creates ResourceProfile
   2. if the profile ID of ResourceProfile is accessed for the first time, then 
the client will ask to create ResourceProfile and add it to the 
ResourceProfileManager on the server side, and the server side will return the 
profile ID to the client which will set the id to the ResourceProfile on the 
client side.
   3. The internal mapInPandas/mapInArrow will just use the ResourceProfile id, 
and the server side can extract the ResourceProfile from ResourceProfileManager 
according to the id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-11 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519241310


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   Hi @grundprinzip,
   
   I'm not quite following you about "by uuids".
   
   So the basic implementation is
   
   1. The client creates ResourceProfile
   2. if the profile ID of ResourceProfile is accessed for the first time, then 
the client will ask to create ResourceProfile and add it to the 
ResourceProfileManager on the server side, and the server side will return the 
profile ID to the client which will set the id to the ResourceProfile on the 
client side.
   3. The internal mapInPandas/mapInArrow will just use the ResourceProfile id, 
and the server side can extract the ResourceProfile from ResourceProfileManager 
according to the id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-11 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519235937


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectBuildResourceProfileHandler.scala:
##
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.service
+
+import scala.jdk.CollectionConverters.MapHasAsScala
+
+import io.grpc.stub.StreamObserver
+
+import org.apache.spark.connect.proto
+import org.apache.spark.internal.Logging
+import org.apache.spark.resource.{ExecutorResourceRequest, ResourceProfile, 
TaskResourceProfile, TaskResourceRequest}
+
+class SparkConnectBuildResourceProfileHandler(
+responseObserver: StreamObserver[proto.BuildResourceProfileResponse])
+extends Logging {
+
+  /**
+   * transform the spark connect ResourceProfile to spark ResourceProfile
+   * @param rp
+   *   Spark connect ResourceProfile
+   * @return
+   *   the Spark ResourceProfile
+   */
+  private def transformResourceProfile(rp: proto.ResourceProfile): 
ResourceProfile = {
+val ereqs = rp.getExecutorResourcesMap.asScala.map { case (name, res) =>
+  name -> new ExecutorResourceRequest(
+res.getResourceName,
+res.getAmount,
+res.getDiscoveryScript,
+res.getVendor)
+}.toMap
+val treqs = rp.getTaskResourcesMap.asScala.map { case (name, res) =>
+  name -> new TaskResourceRequest(res.getResourceName, res.getAmount)
+}.toMap
+
+if (ereqs.isEmpty) {
+  new TaskResourceProfile(treqs)
+} else {
+  new ResourceProfile(ereqs, treqs)
+}
+  }
+
+  def handle(request: proto.BuildResourceProfileRequest): Unit = {
+val holder = SparkConnectService
+  .getOrCreateIsolatedSession(request.getUserContext.getUserId, 
request.getSessionId)
+
+val rp = transformResourceProfile(request.getProfile)
+
+val session = holder.session
+session.sparkContext.resourceProfileManager.addResourceProfile(rp)

Review Comment:
   Yeah, Both ResourceProfile and ResourceProfileManager don't have the 
cleanup. If you think we need to cleanup, we can file another PR for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-11 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519232917


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -1011,5 +1039,7 @@ service SparkConnectService {
 
   // FetchErrorDetails retrieves the matched exception with details based on a 
provided error id.
   rpc FetchErrorDetails(FetchErrorDetailsRequest) returns 
(FetchErrorDetailsResponse) {}
-}
 
+  // Build ResourceProfile and get the profile id
+  rpc BuildResourceProfile(BuildResourceProfileRequest) returns 
(BuildResourceProfileResponse) {}

Review Comment:
   Hi @grundprinzip, Really good suggestion, just made the newest commit to 
move it to the command.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-11 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1519232536


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   Hi @grundprinzip, The user still needs to know the exact ResourceProfile id, 
if we attach resource profile in the call, seems we can't get id in this call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-08 Thread via GitHub



grundprinzip commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1518491571


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   Shouldn't these be uuids? Just to make sure that we have intentional 
difference and no off by one?



##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -1011,5 +1039,7 @@ service SparkConnectService {
 
   // FetchErrorDetails retrieves the matched exception with details based on a 
provided error id.
   rpc FetchErrorDetails(FetchErrorDetailsRequest) returns 
(FetchErrorDetailsResponse) {}
-}
 
+  // Build ResourceProfile and get the profile id
+  rpc BuildResourceProfile(BuildResourceProfileRequest) returns 
(BuildResourceProfileResponse) {}

Review Comment:
   Why does this need to be an extra RPC and not just a command?



##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -892,6 +893,9 @@ message MapPartitions {
 
   // (Optional) Whether to use barrier mode execution or not.
   optional bool is_barrier = 3;
+
+  // (Optional) ResourceProfile id used for the stage level scheduling.
+  optional int32 profile_id = 4;

Review Comment:
   Why do we need an extra RPC for that? Can't you attach the resource profile 
directly to this call?



##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectBuildResourceProfileHandler.scala:
##
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.service
+
+import scala.jdk.CollectionConverters.MapHasAsScala
+
+import io.grpc.stub.StreamObserver
+
+import org.apache.spark.connect.proto
+import org.apache.spark.internal.Logging
+import org.apache.spark.resource.{ExecutorResourceRequest, ResourceProfile, 
TaskResourceProfile, TaskResourceRequest}
+
+class SparkConnectBuildResourceProfileHandler(
+responseObserver: StreamObserver[proto.BuildResourceProfileResponse])
+extends Logging {
+
+  /**
+   * transform the spark connect ResourceProfile to spark ResourceProfile
+   * @param rp
+   *   Spark connect ResourceProfile
+   * @return
+   *   the Spark ResourceProfile
+   */
+  private def transformResourceProfile(rp: proto.ResourceProfile): 
ResourceProfile = {
+val ereqs = rp.getExecutorResourcesMap.asScala.map { case (name, res) =>
+  name -> new ExecutorResourceRequest(
+res.getResourceName,
+res.getAmount,
+res.getDiscoveryScript,
+res.getVendor)
+}.toMap
+val treqs = rp.getTaskResourcesMap.asScala.map { case (name, res) =>
+  name -> new TaskResourceRequest(res.getResourceName, res.getAmount)
+}.toMap
+
+if (ereqs.isEmpty) {
+  new TaskResourceProfile(treqs)
+} else {
+  new ResourceProfile(ereqs, treqs)
+}
+  }
+
+  def handle(request: proto.BuildResourceProfileRequest): Unit = {
+val holder = SparkConnectService
+  .getOrCreateIsolatedSession(request.getUserContext.getUserId, 
request.getSessionId)
+
+val rp = transformResourceProfile(request.getProfile)
+
+val session = holder.session
+session.sparkContext.resourceProfileManager.addResourceProfile(rp)

Review Comment:
   How are these cleaned up?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-08 Thread via GitHub



tgravescs commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1985925035

   changes look fine to me 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-07 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1517085186


##
python/pyspark/resource/tests/test_connect_resources.py:
##
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.resource import ResourceProfileBuilder, TaskResourceRequests
+from pyspark.sql import SparkSession
+
+
+class ResourceProfileTests(unittest.TestCase):
+def test_profile_before_sc_for_connect(self):
+rpb = ResourceProfileBuilder()
+treqs = TaskResourceRequests().cpus(2)
+# no exception for building ResourceProfile
+rp = rpb.require(treqs).build

Review Comment:
   Yes, Added the checking in the test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-07 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1517085041


##
python/pyspark/resource/tests/test_connect_resources.py:
##
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.resource import ResourceProfileBuilder, TaskResourceRequests
+from pyspark.sql import SparkSession
+
+
+class ResourceProfileTests(unittest.TestCase):
+def test_profile_before_sc_for_connect(self):

Review Comment:
   Added the error checking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-07 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1517084819


##
python/pyspark/resource/profile.py:
##
@@ -114,14 +122,23 @@ def id(self) -> int:
 int
 A unique id of this :class:`ResourceProfile`
 """
+with self._lock:
+if self._id is None:
+if self._java_resource_profile is not None:
+self._id = self._java_resource_profile.id()
+else:
+from pyspark.sql import is_remote
 
-if self._java_resource_profile is not None:
-return self._java_resource_profile.id()
-else:
-raise RuntimeError(
-"SparkContext must be created to get the id, get the id "
-"after adding the ResourceProfile to an RDD"
-)
+if is_remote():
+from pyspark.sql.connect.resource.profile import 
ResourceProfile
+
+rp = ResourceProfile(

Review Comment:
   Done to add a comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-07 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1517084721


##
python/pyspark/resource/profile.py:
##
@@ -114,14 +122,23 @@ def id(self) -> int:
 int
 A unique id of this :class:`ResourceProfile`
 """
+with self._lock:
+if self._id is None:
+if self._java_resource_profile is not None:
+self._id = self._java_resource_profile.id()
+else:
+from pyspark.sql import is_remote
 
-if self._java_resource_profile is not None:
-return self._java_resource_profile.id()
-else:
-raise RuntimeError(
-"SparkContext must be created to get the id, get the id "
-"after adding the ResourceProfile to an RDD"
-)
+if is_remote():
+from pyspark.sql.connect.resource.profile import 
ResourceProfile
+
+rp = ResourceProfile(

Review Comment:
   Yes, You're right, for the remote mode, the ResourceProfile is just like the 
proxy of the Spark ResourceProfile on the server side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-07 Thread via GitHub



tgravescs commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1516375405


##
python/pyspark/resource/tests/test_connect_resources.py:
##
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.resource import ResourceProfileBuilder, TaskResourceRequests
+from pyspark.sql import SparkSession
+
+
+class ResourceProfileTests(unittest.TestCase):
+def test_profile_before_sc_for_connect(self):
+rpb = ResourceProfileBuilder()
+treqs = TaskResourceRequests().cpus(2)
+# no exception for building ResourceProfile
+rp = rpb.require(treqs).build

Review Comment:
   can the user properly see the executor/task resource requests back from the 
resource profile?  



##
python/pyspark/resource/tests/test_connect_resources.py:
##
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import unittest
+
+from pyspark.resource import ResourceProfileBuilder, TaskResourceRequests
+from pyspark.sql import SparkSession
+
+
+class ResourceProfileTests(unittest.TestCase):
+def test_profile_before_sc_for_connect(self):

Review Comment:
   Any tests needed for error checking/failure cases and make sure that gets to 
user properly?



##
python/pyspark/resource/profile.py:
##
@@ -114,14 +122,23 @@ def id(self) -> int:
 int
 A unique id of this :class:`ResourceProfile`
 """
+with self._lock:
+if self._id is None:
+if self._java_resource_profile is not None:
+self._id = self._java_resource_profile.id()
+else:
+from pyspark.sql import is_remote
 
-if self._java_resource_profile is not None:
-return self._java_resource_profile.id()
-else:
-raise RuntimeError(
-"SparkContext must be created to get the id, get the id "
-"after adding the ResourceProfile to an RDD"
-)
+if is_remote():
+from pyspark.sql.connect.resource.profile import 
ResourceProfile
+
+rp = ResourceProfile(

Review Comment:
   I haven't reviewed much of the connect code so probaly missing something.  
we do this to get an id but then we throw it away?  I guess this is a spark 
connect ResourceProfile and when its actually created its stored on the spark 
context side so you just need the id to reference?  Adding a comment here would 
be good 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1982322729

   > > Does this PR introduce any user-facing change?
   > > Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through 
the connect pysprark client.
   > 
   > I think you are adding the ResourceProfile api to spark connect for 
anything to use, correct? I guess since Spark Connect doesn't have SparkContext 
support its only usable by these apis? It would be nice to have more info in 
the description about what you are adding and if they match the current python 
api's exactly?
   
   There are no APIs changed/added for ResourceProfile, this PR just changed 
ResourceProfile internally to support Spark Connect. Like, 
[here](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-245bd4d0d3aea466e32424806642d225ffa4f07e75ef1d45f7fd7a68fbb3e4c3R118-R141)
 
[here](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-e2439904a861b6dd69efec067e87ff43070e3b673911ab286aca382897954c2bR167-R469)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515490084


##
dev/sparktestsupport/modules.py:
##
@@ -554,6 +554,7 @@ def __hash__(self):
 "pyspark.resource.profile",
 # unittests
 "pyspark.resource.tests.test_resources",
+"pyspark.resource.tests.test_connect_resources",

Review Comment:
   This pull request already includes 
[pyspark.sql.tests.connect.test_resources](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-1ebc67a99155114aa6bced912ce7942956ce784eeb59aadab75b82814684bd27R1031)
 to test the general mapInPandas/mapInArrow functionality with ResourceProfile. 
On the other hand, `pyspark.resource.tests.test_connect_resources` is 
specifically for testing special cases like creating a ResourceProfile before 
establishing a remote session. Therefore, it seems appropriate to keep the 
tests in their respective locations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515485716


##
python/pyspark/sql/connect/resource/profile.py:
##
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class ResourceProfile:

Review Comment:
   This ResourceProfile is not user-facing; it is primarily used internally to 
create the resource profile on the server side and retrieve the associated 
resource profile ID.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



zhengruifeng commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515455169


##
python/pyspark/sql/connect/resource/profile.py:
##
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class ResourceProfile:

Review Comment:
   is this class user-facing?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



zhengruifeng commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515448887


##
dev/sparktestsupport/modules.py:
##
@@ -554,6 +554,7 @@ def __hash__(self):
 "pyspark.resource.profile",
 # unittests
 "pyspark.resource.tests.test_resources",
+"pyspark.resource.tests.test_connect_resources",

Review Comment:
   this test is for spark connect, I think we should move it to Module 
`pyspark_connect`?
   
   maybe we can move the test cases in it to 
`pyspark.sql.tests.connect.test_resources`?



##
dev/sparktestsupport/modules.py:
##
@@ -554,6 +554,7 @@ def __hash__(self):
 "pyspark.resource.profile",
 # unittests
 "pyspark.resource.tests.test_resources",
+"pyspark.resource.tests.test_connect_resources",

Review Comment:
   this test is for spark connect, I think we should move it to Module 
`pyspark_connect`?
   
   or maybe we can move the test cases in it to 
`pyspark.sql.tests.connect.test_resources`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



tgravescs commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1980935581

   >Does this PR introduce any user-facing change?
   > Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through the 
connect pysprark client.
   
   I think you are adding the ResourceProfile api to spark connect for anything 
to use, correct?   I guess since Spark Connect doesn't have SparkContext 
support its only usable by these apis? It would be nice to have more info 
in the description about what you are adding and if they match the current 
python api's exactly?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1514004559


##
python/pyspark/resource/profile.py:
##
@@ -114,14 +122,26 @@ def id(self) -> int:
 int
 A unique id of this :class:`ResourceProfile`
 """
-
-if self._java_resource_profile is not None:
-return self._java_resource_profile.id()
-else:
-raise RuntimeError(
-"SparkContext must be created to get the id, get the id "
-"after adding the ResourceProfile to an RDD"
-)
+with self._lock:
+if self._id is None:
+if self._java_resource_profile is not None:
+self._id = self._java_resource_profile.id()
+else:
+from pyspark.sql import is_remote
+
+if is_remote():
+from pyspark.sql.connect.resource.profile import 
_ResourceProfile
+
+rp = _ResourceProfile(
+self._executor_resource_requests, 
self._task_resource_requests
+)
+self._id = rp.id

Review Comment:
   Ah, okay. you're getting it from `SparkConnectBuildResourceProfileHandler`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1980292725

   Hi @HyukjinKwon, Could you help review again, thx very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1513993480


##
python/pyspark/sql/connect/resource/profile.py:
##
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class _ResourceProfile:

Review Comment:
   Done. Thx for the info



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1513991688


##
python/pyspark/resource/profile.py:
##
@@ -114,14 +122,26 @@ def id(self) -> int:
 int
 A unique id of this :class:`ResourceProfile`
 """
-
-if self._java_resource_profile is not None:
-return self._java_resource_profile.id()
-else:
-raise RuntimeError(
-"SparkContext must be created to get the id, get the id "
-"after adding the ResourceProfile to an RDD"
-)
+with self._lock:
+if self._id is None:
+if self._java_resource_profile is not None:
+self._id = self._java_resource_profile.id()
+else:
+from pyspark.sql import is_remote
+
+if is_remote():
+from pyspark.sql.connect.resource.profile import 
_ResourceProfile
+
+rp = _ResourceProfile(
+self._executor_resource_requests, 
self._task_resource_requests
+)
+self._id = rp.id

Review Comment:
   the `self._id` sets to None by default, 
https://github.com/apache/spark/pull/45232/files/95a75af98715a750b78529c1a04867206a99dfe6#diff-c570ce2f67f7eef5e5f223c6c638903d2dd364bc2ab36c59a0e7132c742682ffR65
   
   while the `rp.id` is set from 
https://github.com/apache/spark/pull/45232/files/95a75af98715a750b78529c1a04867206a99dfe6#diff-c570ce2f67f7eef5e5f223c6c638903d2dd364bc2ab36c59a0e7132c742682ffR65



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-05 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1513987011


##
python/pyspark/resource/profile.py:
##
@@ -114,14 +122,26 @@ def id(self) -> int:
 int
 A unique id of this :class:`ResourceProfile`
 """
-
-if self._java_resource_profile is not None:
-return self._java_resource_profile.id()
-else:
-raise RuntimeError(
-"SparkContext must be created to get the id, get the id "
-"after adding the ResourceProfile to an RDD"
-)
+with self._lock:
+if self._id is None:
+if self._java_resource_profile is not None:
+self._id = self._java_resource_profile.id()
+else:
+from pyspark.sql import is_remote
+
+if is_remote():
+from pyspark.sql.connect.resource.profile import 
_ResourceProfile
+
+rp = _ResourceProfile(
+self._executor_resource_requests, 
self._task_resource_requests
+)
+self._id = rp.id
+else:
+raise RuntimeError(
+"SparkContext must be created to get the id, get 
the id "
+"after adding the ResourceProfile to an RDD"

Review Comment:
   The above 
[if](https://github.com/apache/spark/pull/45232/files/95a75af98715a750b78529c1a04867206a99dfe6#diff-245bd4d0d3aea466e32424806642d225ffa4f07e75ef1d45f7fd7a68fbb3e4c3R132)
 is for connect case, while this branch is for the normal RDD usage. Yeah, 
you're right, Sql also has supported ResourceProfile, so let me change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-04 Thread via GitHub



HyukjinKwon commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1976100589

   Looks fine in general


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-04 Thread via GitHub



HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1510840968


##
python/pyspark/resource/profile.py:
##
@@ -114,14 +122,26 @@ def id(self) -> int:
 int
 A unique id of this :class:`ResourceProfile`
 """
-
-if self._java_resource_profile is not None:
-return self._java_resource_profile.id()
-else:
-raise RuntimeError(
-"SparkContext must be created to get the id, get the id "
-"after adding the ResourceProfile to an RDD"
-)
+with self._lock:
+if self._id is None:
+if self._java_resource_profile is not None:
+self._id = self._java_resource_profile.id()
+else:
+from pyspark.sql import is_remote
+
+if is_remote():
+from pyspark.sql.connect.resource.profile import 
_ResourceProfile
+
+rp = _ResourceProfile(
+self._executor_resource_requests, 
self._task_resource_requests
+)
+self._id = rp.id

Review Comment:
   Sorry if I missed sth but where is this id being set?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-04 Thread via GitHub



HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1510824227


##
python/pyspark/sql/connect/resource/profile.py:
##
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class _ResourceProfile:

Review Comment:
   It's fine without leading `_`. The whole `connect` package isn't supposed to 
be external.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-04 Thread via GitHub



HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1510818206


##
python/pyspark/resource/profile.py:
##
@@ -114,14 +122,26 @@ def id(self) -> int:
 int
 A unique id of this :class:`ResourceProfile`
 """
-
-if self._java_resource_profile is not None:
-return self._java_resource_profile.id()
-else:
-raise RuntimeError(
-"SparkContext must be created to get the id, get the id "
-"after adding the ResourceProfile to an RDD"
-)
+with self._lock:
+if self._id is None:
+if self._java_resource_profile is not None:
+self._id = self._java_resource_profile.id()
+else:
+from pyspark.sql import is_remote
+
+if is_remote():
+from pyspark.sql.connect.resource.profile import 
_ResourceProfile
+
+rp = _ResourceProfile(
+self._executor_resource_requests, 
self._task_resource_requests
+)
+self._id = rp.id
+else:
+raise RuntimeError(
+"SparkContext must be created to get the id, get 
the id "
+"after adding the ResourceProfile to an RDD"

Review Comment:
   RDD is hidden in Spark Connect concept. Let's probably avoid this term.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-04 Thread via GitHub



HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1510814720


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -967,6 +967,34 @@ message FetchErrorDetailsResponse {
   }
 }
 
+message BuildResourceProfileRequest {

Review Comment:
   This would need @grundprinzip and @hvanhovell 's review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-02-28 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1970509831

   Hi @tgravescs, This PR changed ResourceProfile a little bit to support 
connect, Could you help review it? Thx very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-02-25 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1963270400

   Hi @tgravescs @WeichenXu123 @zhengruifeng @Ngone51, Could you also please 
help review it. Thx.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-02-25 Thread via GitHub



wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1502014261


##
python/pyspark/resource/profile.py:
##
@@ -99,6 +99,11 @@ def __init__(
 _exec_req: Optional[Dict[str, ExecutorResourceRequest]] = None,

Review Comment:
   Thx, Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-02-25 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1963263080

   ## With dynamic allocation enabled.
   
   ``` bash
   start-connect-server.sh --master spark://192.168.0.106:7077 \
  --jars jars/spark-connect_2.13-4.0.0-SNAPSHOT.jar  \
 --conf spark.executor.cores=4 \
 --conf spark.task.cpus=1 \
 --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.maxExecutors=1 \
   ```
   
   The above command enables the dynamic allocation and the max executors 
required is set to 1 in order to test. And then launch the spark connect 
pyspark client by
   
   ``` bash
   pyspark --remote "sc://localhost"
   ```
   
   ### TaskResourceProfile without any specific executor request information
   
   Test code,
   
   ```python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   
   def filter_func(iterator):
   for pdf in iterator:
   yield pdf
   
   df = spark.range(0, 100, 1, 4)
   
   treqs = TaskResourceRequests().cpus(3)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, 
rp).collect()
   ```
   
   The rp refers to the TaskResourceProfile without any specific executor 
request information, thus the executor information will utilize the default 
values from Default ResourceProfile (executor.cores=4).
   
   The above code will require an extra executor which will have the same 
`executor.cores/memory` as the default ResourceProfile.
   
   
![0](https://github.com/apache/spark/assets/1320706/abd5a1ad-8564-490f-abc5-a45621496040)
   
   
![1](https://github.com/apache/spark/assets/1320706/340954c5-0dd8-42f3-9368-5dccdd3d7b62)
   
   
![2](https://github.com/apache/spark/assets/1320706/7280b9da-6346-4f81-853a-81e37aaecb9d)
   
   
   
   ### Different executor request information 
   
   ```python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   
   def filter_func(iterator):
   for pdf in iterator:
   yield pdf
   
   df = spark.range(0, 100, 1, 4)
   
   ereqs = ExecutorResourceRequests().cores(6)
   treqs = TaskResourceRequests().cpus(5)
   rp = ResourceProfileBuilder().require(treqs).require(ereqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, 
rp).collect()
   ```
   
   
![5](https://github.com/apache/spark/assets/1320706/eb104194-c156-4319-ac48-f1d199304501)
   
   
![6](https://github.com/apache/spark/assets/1320706/a514f1a7-4ffd-497d-897f-6b5f786ea86b)
   
   
![7](https://github.com/apache/spark/assets/1320706/7ade7aec-992e-44ed-ae3f-70b25c208243)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-02-25 Thread via GitHub



wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1963261901

   # Manual tests
   
   
   The manual tests were conducted on a spark Standalone cluster with only 1 
worker which has 6 cpu cores.
   
   ## With dynamic allocation disabled.
   
   ``` bash
   start-connect-server.sh --master spark://192.168.0.106:7077 \
  --jars jars/spark-connect_2.13-4.0.0-SNAPSHOT.jar  \
 --conf spark.executor.cores=4 \
 --conf spark.task.cpus=1 \
 --conf spark.dynamicAllocation.enabled=false
   ```
   
   The above command starts the connect server and it requires 1 executor with 
4 CPU cores, and the default `task.cpus = 1`, so the default tasks parallelism 
is 4 at a time.
   
   And then launch the spark connect pyspark client by
   
   ``` bash
   pyspark --remote "sc://localhost"
   ```
   
   1. `task.cores=1`
   
   Test code:
   
   ``` python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   
   def filter_func(iterator):
   for pdf in iterator:
   yield pdf
   
   df = spark.range(0, 100, 1, 6)
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   treqs = TaskResourceRequests().cpus(1)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, 
rp).collect()
   ```
   
   When the required `task.cpus=1`, `executor.cores=4` (No executor resource 
specified, use the default one), there will be 4 tasks running for rp at the 
same time.
   
   The entire Spark application consists of a single Spark job that will be 
divided into two stages. The first shuffle stage comprises 6 tasks, the first 4 
tasks will be executed simultaneously, then the last 2 tasks.
   
   
![1](https://github.com/apache/spark/assets/1320706/720fef7b-3a72-456f-9c60-01b86011ec84)
   
   The second ResultStage comprises 3 tasks, all of which will be executed 
simultaneously since the required `task.cpus` is  1.
   
   
![2](https://github.com/apache/spark/assets/1320706/0804d2af-e25d-4f54-906e-cc367a6aa2eb)
   
   2. `task.cores=2`
   
   Test code,
   
   ``` python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   
   def filter_func(iterator):
   for pdf in iterator:
   yield pdf
   
   df = spark.range(0, 100, 1, 6)
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   treqs = TaskResourceRequests().cpus(2)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, 
rp).collect()
   ```
   
   When the required `task.cpus=2`, `executor.cores=4` (No executor resource 
specified, use the default one), there will be 2 tasks running for rp.
   
   The first shuffle stage behaves the same as the first one. 
   
   The second ResultStage comprises 3 tasks, so the first 2 tasks will be 
running at a time, and then execute the last task.
   
   
![3](https://github.com/apache/spark/assets/1320706/870fedb2-52f0-4a54-9b23-f37b9d2a2228)
   
   3. `task.cores=3`
   
   Test code,
   
   ``` python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   
   def filter_func(iterator):
   for pdf in iterator:
   yield pdf
   
   df = spark.range(0, 100, 1, 6)
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   treqs = TaskResourceRequests().cpus(3)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, 
rp).collect()
   ```
   
   When the required `task.cpus=3`, `executor.cores=4` (No executor resource 
specified, use the default one), there will be 1 task running for rp.
   
   The first shuffle stage behaves the same as the first one. 
   
   The second ResultStage comprises 3 tasks, all of which will be running 
serially.
   
   
![4](https://github.com/apache/spark/assets/1320706/a6b730ab-99d5-4563-a853-0682fcd3a10d)
   
   
   4. `task.cores=5`
   
   ``` python
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   
   def filter_func(iterator):
   for pdf in iterator:
   yield pdf
   
   df = spark.range(0, 100, 1, 6)
   from pyspark.resource import ExecutorResourceRequests, TaskResourceRequests, 
ResourceProfileBuilder
   treqs = TaskResourceRequests().cpus(5)
   rp = ResourceProfileBuilder().require(treqs).build
   df.repartition(3).mapInArrow(lambda iter: iter, df.schema, False, 
rp).collect()
   ```
   
   exception happened.
   ``` console
   Traceback (most recent call last):
 File "", line 1, in 
 File 
"/home/bobwang/github/mytools/spark.home/spark-4.0.0-SNAPSHOT-bin-wbo4958-spark/python/pyspark/sql/connect/dataframe.py",
 line 1763, in collect

Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-02-25 Thread via GitHub



HyukjinKwon commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1501943441


##
python/pyspark/resource/profile.py:
##
@@ -99,6 +99,11 @@ def __init__(
 _exec_req: Optional[Dict[str, ExecutorResourceRequest]] = None,

Review Comment:
   Let's add `versionchanged` directive at `ResourceProfile` docstring. e,g.,:
   
   ```
   .. versionchanged:: 4.0.0
   Supports Spark Connect.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-02-23 Thread via GitHub



wbo4958 opened a new pull request, #45232:
URL: https://github.com/apache/spark/pull/45232

   ### What changes were proposed in this pull request?
   
   Support stage-level scheduling for PySpark connect DataFrame APIs 
(mapInPandas and mapInArrow).
   
   
   ### Why are the changes needed?
   
   This is the follow-up of https://github.com/apache/spark/pull/44852.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through the 
connect pysprark client.
   
   
   ### How was this patch tested?
   
   Pass the CIs and manual tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

45 matches

Mail list logo