Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-07 Thread via GitHub


xinrong-meng commented on PR #45378:
URL: https://github.com/apache/spark/pull/45378#issuecomment-1984523232

   Merged to master, thank you all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-07 Thread via GitHub


xinrong-meng closed pull request #45378: [SPARK-47276][PYTHON][CONNECT] 
Introduce `spark.profile.clear` for SparkSession-based profiling
URL: https://github.com/apache/spark/pull/45378


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-07 Thread via GitHub


xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1516752307


##
python/pyspark/sql/tests/test_session.py:
##
@@ -531,6 +531,33 @@ def test_dump_invalid_type(self):
 },
 )
 
+def test_clear_memory_type(self):

Review Comment:
   Good idea!
   
   For now, all logic tested by SparkSessionProfileTests is directly imported 
in Spark Connect with no modification. But I do agree separating it later will 
improve readability and ensure future parity. I'll refactor later. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-07 Thread via GitHub


ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1516750441


##
python/pyspark/sql/profiler.py:
##
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
 for id in sorted(code_map.keys()):
 dump(id)
 
+def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+"""
+Clear the perf profile results.
+
+.. versionadded:: 4.0.0

Review Comment:
   Actually this is not. The `clear` in `Profile` should be a user-facing API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-07 Thread via GitHub


xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1516752307


##
python/pyspark/sql/tests/test_session.py:
##
@@ -531,6 +531,33 @@ def test_dump_invalid_type(self):
 },
 )
 
+def test_clear_memory_type(self):

Review Comment:
   Good idea!
   
   For now, all logic tested by SparkSessionProfileTests is directly imported 
in Spark Connect with no modification. But I do agree separating it later will 
improve readability. I'll refactor later. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


zhengruifeng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515475836


##
python/pyspark/sql/tests/test_session.py:
##
@@ -531,6 +531,33 @@ def test_dump_invalid_type(self):
 },
 )
 
+def test_clear_memory_type(self):

Review Comment:
   nit, it seems we don't have a parity test for `test_session`. does it make 
sense to move `SparkSessionProfileTests` out of `test_session` and add parity 
test for it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515363804


##
python/pyspark/sql/profiler.py:
##
@@ -236,18 +236,22 @@ def clear_perf_profiles(self, id: Optional[int] = None) 
-> None:
 The UDF ID whose profiling results should be cleared.
 If not specified, all the results will be cleared.
 """
-ids_to_remove = [
-result_id
-for result_id, (perf, _, *_) in self._profile_results.items()
-if perf is not None
-]
 with self._lock:
 if id is not None:
-if id in ids_to_remove:
-self._profile_results.pop(id, None)
+if id in self._profile_results:
+perf, mem, *rest = self._profile_results[id]
+self._profile_results[id] = (None, mem, *rest)
+if mem is None:
+self._profile_results.pop(id, None)
 else:
-for id_to_remove in ids_to_remove:
-self._profile_results.pop(id_to_remove, None)
+ids_to_remove = []
+for id, (perf, mem, *rest) in 
list(self._profile_results.items()):
+self._profile_results[id] = (None, mem, *rest)
+if mem is None:
+ids_to_remove.append(id)

Review Comment:
   Good idea! Adjusted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515240180


##
python/pyspark/sql/profiler.py:
##
@@ -236,18 +236,22 @@ def clear_perf_profiles(self, id: Optional[int] = None) 
-> None:
 The UDF ID whose profiling results should be cleared.
 If not specified, all the results will be cleared.
 """
-ids_to_remove = [
-result_id
-for result_id, (perf, _, *_) in self._profile_results.items()
-if perf is not None
-]
 with self._lock:
 if id is not None:
-if id in ids_to_remove:
-self._profile_results.pop(id, None)
+if id in self._profile_results:
+perf, mem, *rest = self._profile_results[id]
+self._profile_results[id] = (None, mem, *rest)
+if mem is None:
+self._profile_results.pop(id, None)
 else:
-for id_to_remove in ids_to_remove:
-self._profile_results.pop(id_to_remove, None)
+ids_to_remove = []
+for id, (perf, mem, *rest) in 
list(self._profile_results.items()):
+self._profile_results[id] = (None, mem, *rest)
+if mem is None:
+ids_to_remove.append(id)

Review Comment:
   nit: Can't we pop it here?



##
python/pyspark/sql/profiler.py:
##
@@ -262,15 +266,21 @@ def clear_memory_profiles(self, id: Optional[int] = None) 
-> None:
 If not specified, all the results will be cleared.
 """
 with self._lock:
-ids_to_remove = [
-id for id, (_, mem, *_) in self._profile_results.items() if 
mem is not None
-]
 if id is not None:
-if id in ids_to_remove:
-self._profile_results.pop(id, None)
+if id in self._profile_results:
+perf, mem, *rest = self._profile_results[id]
+self._profile_results[id] = (perf, None, *rest)
+if perf is None:
+self._profile_results.pop(id, None)
 else:
-for id_to_remove in ids_to_remove:
-self._profile_results.pop(id_to_remove, None)
+ids_to_remove = []
+for id, (perf, mem, *rest) in 
list(self._profile_results.items()):
+self._profile_results[id] = (perf, None, *rest)
+if perf is None:
+ids_to_remove.append(id)

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515212345


##
python/pyspark/sql/profiler.py:
##
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
 for id in sorted(code_map.keys()):
 dump(id)
 
+def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+"""
+Clear the perf profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+id : int, optional
+The UDF ID whose profiling results should be cleared.
+If not specified, all the results will be cleared.
+"""
+ids_to_remove = [
+result_id
+for result_id, (perf, _, *_) in self._profile_results.items()
+if perf is not None
+]
+with self._lock:
+if id is not None:
+if id in ids_to_remove:
+self._profile_results.pop(id, None)

Review Comment:
   Good catch! Thanks for the example. I adjusted the code and added tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515098224


##
python/pyspark/sql/profiler.py:
##
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
 for id in sorted(code_map.keys()):
 dump(id)
 
+def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+"""
+Clear the perf profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+id : int, optional
+The UDF ID whose profiling results should be cleared.
+If not specified, all the results will be cleared.
+"""
+ids_to_remove = [
+result_id
+for result_id, (perf, _, *_) in self._profile_results.items()
+if perf is not None
+]
+with self._lock:
+if id is not None:
+if id in ids_to_remove:
+self._profile_results.pop(id, None)

Review Comment:
   On Jupyter:
   
   ```py
   from pyspark.sql.functions import pandas_udf
   df = spark.range(3)
   
   @pandas_udf("long")
   def add1(x):
   return x + 1
   
   added = df.select(add1("id"))
   
   spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
   added.show()
   
   spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")
   added.show()
   
   spark.profile.show()
   ...
   
   spark.profile.clear(type="memory")
   
   spark.profile.show()  # should still show the perf results?
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-05 Thread via GitHub


ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1513759920


##
python/pyspark/sql/profiler.py:
##
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
 for id in sorted(code_map.keys()):
 dump(id)
 
+def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+"""
+Clear the perf profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+id : int, optional
+The UDF ID whose profiling results should be cleared.
+If not specified, all the results will be cleared.
+"""
+ids_to_remove = [
+result_id
+for result_id, (perf, _, *_) in self._profile_results.items()
+if perf is not None
+]
+with self._lock:
+if id is not None:
+if id in ids_to_remove:
+self._profile_results.pop(id, None)

Review Comment:
   Seems to be removing `memory` as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-05 Thread via GitHub


xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1513714435


##
python/pyspark/sql/profiler.py:
##
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
 for id in sorted(code_map.keys()):
 dump(id)
 
+def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+"""
+Clear the perf profile results.
+
+.. versionadded:: 4.0.0

Review Comment:
   It is a user-facing API, along with `profile.show` and `profile.dump`. We 
will also add it to API doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-05 Thread via GitHub


HyukjinKwon commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1513693941


##
python/pyspark/sql/profiler.py:
##
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
 for id in sorted(code_map.keys()):
 dump(id)
 
+def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+"""
+Clear the perf profile results.
+
+.. versionadded:: 4.0.0

Review Comment:
   Is this a user-facing API? If not, we don't need this version directive



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-05 Thread via GitHub


xinrong-meng commented on PR #45378:
URL: https://github.com/apache/spark/pull/45378#issuecomment-1979697434

   Failed tests are irrelevant to changes proposed in this PR. Rerun failed 
tests https://github.com/xinrong-meng/spark/actions/runs/8162084262.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org