[ 
https://issues.apache.org/jira/browse/SPARK-40281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40281:
---------------------------------
    Description: 
Profiling is critical to performance engineering. Memory consumption is a key 
indicator of how efficient a PySpark program is. There is an existing effort on 
memory profiling of Python programs, Memory Profiler 
([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]

PySpark applications run as independent sets of processes on a cluster, 
coordinated by the SparkContext object in the driver program. On the driver 
side, PySpark is a regular Python process, thus, we can profile it as a normal 
Python program using Memory Profiler.

However, on the executors side, we are missing such memory profiler. Since 
executors are distributed on different nodes in the cluster, we need to 
aggregate profiles. Furthermore, Python worker processes are spawned per 
executor for the Python/Pandas UDF execution, which makes the memory profiling 
more intricate.

The ticket proposes to implement a Memory Profiler on Executors.

 

See more 
[design|https://docs.google.com/document/d/e/2PACX-1vQLphItWY-WYO32ZQwtBpYbagqfep_Hk-cL_-UV8r6tiYFMp1QDJPGNmBEi-xBp_vlkcCMCW0hDBI6j/pub]

  was:
Profiling is critical to performance engineering. Memory consumption is a key 
indicator of how efficient a PySpark program is. There is an existing effort on 
memory profiling of Python programs, Memory Profiler 
([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]

PySpark applications run as independent sets of processes on a cluster, 
coordinated by the SparkContext object in the driver program. On the driver 
side, PySpark is a regular Python process, thus, we can profile it as a normal 
Python program using Memory Profiler.

However, on the executors side, we are missing such memory profiler. Since 
executors are distributed on different nodes in the cluster, we need to 
aggregate profiles. Furthermore, Python worker processes are spawned per 
executor for the Python/Pandas UDF execution, which makes the memory profiling 
more intricate.

The ticket proposes to implement a Memory Profiler on Executors.


> Memory Profiler on Executors
> ----------------------------
>
>                 Key: SPARK-40281
>                 URL: https://issues.apache.org/jira/browse/SPARK-40281
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Xinrong Meng
>            Priority: Major
>
> Profiling is critical to performance engineering. Memory consumption is a key 
> indicator of how efficient a PySpark program is. There is an existing effort 
> on memory profiling of Python programs, Memory Profiler 
> ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]
> PySpark applications run as independent sets of processes on a cluster, 
> coordinated by the SparkContext object in the driver program. On the driver 
> side, PySpark is a regular Python process, thus, we can profile it as a 
> normal Python program using Memory Profiler.
> However, on the executors side, we are missing such memory profiler. Since 
> executors are distributed on different nodes in the cluster, we need to 
> aggregate profiles. Furthermore, Python worker processes are spawned per 
> executor for the Python/Pandas UDF execution, which makes the memory 
> profiling more intricate.
> The ticket proposes to implement a Memory Profiler on Executors.
>  
> See more 
> [design|https://docs.google.com/document/d/e/2PACX-1vQLphItWY-WYO32ZQwtBpYbagqfep_Hk-cL_-UV8r6tiYFMp1QDJPGNmBEi-xBp_vlkcCMCW0hDBI6j/pub]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to