Xinrong Meng created SPARK-40281: ------------------------------------ Summary: Memory Profiler on Executors Key: SPARK-40281 URL: https://issues.apache.org/jira/browse/SPARK-40281 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng
Profiling is critical to performance engineering. Memory consumption is a key indicator of how efficient a PySpark program is. There is an existing effort on memory profiling of Python progrms, Memory Profiler ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/] PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process, thus, we can profile it as a normal Python program using Memory Profiler. However, on the executors side, we are missing such memory profiler. Since executors are distributed on different nodes in the cluster, we need to need to aggregate profiles. Furthermore, Python worker processes are spawned per executor for the Python/Pandas UDF execution, which makes the memory profiling more intricate. The umbrella proposes to implement a Memory Profiler on Executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org