This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new d1b88a71d93 [SPARK-41150][PYTHON][DOCS] Document debugging with PySpark memory profiler d1b88a71d93 is described below commit d1b88a71d937e5b64e429ba0a35aeeb5d0bba6c4 Author: Xinrong Meng <xinr...@apache.org> AuthorDate: Thu Nov 17 10:46:21 2022 +0900 [SPARK-41150][PYTHON][DOCS] Document debugging with PySpark memory profiler ### What changes were proposed in this pull request? Document how to debug Python/Pandas UDFs with the PySpark memory profiler. ### Why are the changes needed? That's a sub-task of [SPARK-40281](https://issues.apache.org/jira/browse/SPARK-40281) Memory Profiler on Executors. Since the PySpark memory profiler has been implemented, we should document how to debug Python/Pandas UDFs with it. ### Does this PR introduce _any_ user-facing change? No. Documentation changes only. ### How was this patch tested? Existing tests. Closes #38677 from xinrong-meng/debug_doc. Authored-by: Xinrong Meng <xinr...@apache.org> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- python/docs/source/development/debugging.rst | 62 +++++++++++++++++++++++++++- 1 file changed, 61 insertions(+), 1 deletion(-) diff --git a/python/docs/source/development/debugging.rst b/python/docs/source/development/debugging.rst index 05c47ae4bf7..ba656294ef4 100644 --- a/python/docs/source/development/debugging.rst +++ b/python/docs/source/development/debugging.rst @@ -172,7 +172,10 @@ Profiling Memory Usage (Memory Profiler) ---------------------------------------- `memory_profiler <https://github.com/pythonprofilers/memory_profiler>`_ is one of the profilers that allow you to -check the memory usage line by line. This method documented here *only works for the driver side*. +check the memory usage line by line. + +Driver Side +~~~~~~~~~~~ Unless you are running your driver program in another machine (e.g., YARN cluster mode), this useful tool can be used to debug the memory usage on driver side easily. Suppose your PySpark script name is ``profile_memory.py``. @@ -208,6 +211,63 @@ You can profile it as below. 8 51.5 MiB 0.0 MiB df = session.range(10000) 9 54.4 MiB 2.8 MiB return df.collect() +Python/Pandas UDF +~~~~~~~~~~~~~~~~~ + +PySpark provides remote `memory_profiler <https://github.com/pythonprofilers/memory_profiler>`_ for +Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile.memory`` configuration to ``true``. That +can be used on editors with line numbers such as Jupyter notebooks. An example on a Jupyter notebook is as shown below. + +.. code-block:: bash + + pyspark --conf spark.python.profile.memory=true + + +.. code-block:: python + + from pyspark.sql.functions import pandas_udf + df = spark.range(10) + + @pandas_udf("long") + def add1(x): + return x + 1 + + added = df.select(add1("id")) + added.show() + sc.show_profiles() + + +The result profile is as shown below. + +.. code-block:: text + + ============================================================ + Profile of UDF<id=2> + ============================================================ + Filename: ... + + Line # Mem usage Increment Occurrences Line Contents + ============================================================= + 4 974.0 MiB 974.0 MiB 10 @pandas_udf("long") + 5 def add1(x): + 6 974.4 MiB 0.4 MiB 10 return x + 1 + +The UDF IDs can be seen in the query plan, for example, ``add1(...)#2L`` in ``ArrowEvalPython`` as shown below. + +.. code-block:: python + + added.explain() + + +.. code-block:: text + + == Physical Plan == + *(2) Project [pythonUDF0#11L AS add1(id)#3L] + +- ArrowEvalPython [add1(id#0L)#2L], [pythonUDF0#11L], 200 + +- *(1) Range (0, 10, step=1, splits=16) + +This feature is not supported with registered UDFs or UDFs with iterators as inputs/outputs. + Identifying Hot Loops (Python Profilers) ---------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org