[spark] branch master updated: [SPARK-41150][PYTHON][DOCS] Document debugging with PySpark memory profiler

gurwls223 Wed, 16 Nov 2022 17:46:39 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new d1b88a71d93 [SPARK-41150][PYTHON][DOCS] Document debugging with 
PySpark memory profiler
d1b88a71d93 is described below

commit d1b88a71d937e5b64e429ba0a35aeeb5d0bba6c4
Author: Xinrong Meng <xinr...@apache.org>
AuthorDate: Thu Nov 17 10:46:21 2022 +0900

    [SPARK-41150][PYTHON][DOCS] Document debugging with PySpark memory profiler
    
    ### What changes were proposed in this pull request?
    Document how to debug Python/Pandas UDFs with the PySpark memory profiler.
    
    ### Why are the changes needed?
    That's a sub-task of 
[SPARK-40281](https://issues.apache.org/jira/browse/SPARK-40281) Memory 
Profiler on Executors.
    
    Since the PySpark memory profiler has been implemented, we should document 
how to debug Python/Pandas UDFs with it.
    
    ### Does this PR introduce _any_ user-facing change?
    No. Documentation changes only.
    
    ### How was this patch tested?
    Existing tests.
    
    Closes #38677 from xinrong-meng/debug_doc.
    
    Authored-by: Xinrong Meng <xinr...@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 python/docs/source/development/debugging.rst | 62 +++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/development/debugging.rst 
b/python/docs/source/development/debugging.rst
index 05c47ae4bf7..ba656294ef4 100644
--- a/python/docs/source/development/debugging.rst
+++ b/python/docs/source/development/debugging.rst
@@ -172,7 +172,10 @@ Profiling Memory Usage (Memory Profiler)
 ----------------------------------------
 
 `memory_profiler <https://github.com/pythonprofilers/memory_profiler>`_ is one 
of the profilers that allow you to
-check the memory usage line by line. This method documented here *only works 
for the driver side*.
+check the memory usage line by line.
+
+Driver Side
+~~~~~~~~~~~
 
 Unless you are running your driver program in another machine (e.g., YARN 
cluster mode), this useful tool can be used
 to debug the memory usage on driver side easily. Suppose your PySpark script 
name is ``profile_memory.py``.
@@ -208,6 +211,63 @@ You can profile it as below.
          8     51.5 MiB      0.0 MiB       df = session.range(10000)
          9     54.4 MiB      2.8 MiB       return df.collect()
 
+Python/Pandas UDF
+~~~~~~~~~~~~~~~~~
+
+PySpark provides remote `memory_profiler 
<https://github.com/pythonprofilers/memory_profiler>`_ for
+Python/Pandas UDFs, which can be enabled by setting 
``spark.python.profile.memory`` configuration to ``true``. That
+can be used on editors with line numbers such as Jupyter notebooks. An example 
on a Jupyter notebook is as shown below.
+
+.. code-block:: bash
+
+    pyspark --conf spark.python.profile.memory=true
+
+
+.. code-block:: python
+
+    from pyspark.sql.functions import pandas_udf
+    df = spark.range(10)
+
+    @pandas_udf("long")
+    def add1(x):
+      return x + 1
+
+    added = df.select(add1("id"))
+    added.show()
+    sc.show_profiles()
+
+
+The result profile is as shown below.
+
+.. code-block:: text
+
+    ============================================================
+    Profile of UDF<id=2>
+    ============================================================
+    Filename: ...
+
+    Line #    Mem usage    Increment  Occurrences   Line Contents
+    =============================================================
+         4    974.0 MiB    974.0 MiB          10   @pandas_udf("long")
+         5                                         def add1(x):
+         6    974.4 MiB      0.4 MiB          10     return x + 1
+
+The UDF IDs can be seen in the query plan, for example, ``add1(...)#2L`` in 
``ArrowEvalPython`` as shown below.
+
+.. code-block:: python
+
+    added.explain()
+
+
+.. code-block:: text
+
+    == Physical Plan ==
+    *(2) Project [pythonUDF0#11L AS add1(id)#3L]
+    +- ArrowEvalPython [add1(id#0L)#2L], [pythonUDF0#11L], 200
+       +- *(1) Range (0, 10, step=1, splits=16)
+
+This feature is not supported with registered UDFs or UDFs with iterators as 
inputs/outputs.
+
 
 Identifying Hot Loops (Python Profilers)
 ----------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-41150][PYTHON][DOCS] Document debugging with PySpark memory profiler

Reply via email to