GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2556
[SPARK-3478] [PySpark] Profile the Python tasks This patch add profiling support for PySpark, it will show the profiling results before the driver exits, here is one example: ``` ============================================================ Profile of RDD<id=3> ============================================================ 5146507 function calls (5146487 primitive calls) in 71.094 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 5144576 68.331 0.000 68.331 0.000 statcounter.py:44(merge) 20 2.735 0.137 71.071 3.554 statcounter.py:33(__init__) 20 0.017 0.001 0.017 0.001 {cPickle.dumps} 1024 0.003 0.000 0.003 0.000 t.py:16(<lambda>) 20 0.001 0.000 0.001 0.000 {reduce} 21 0.001 0.000 0.001 0.000 {cPickle.loads} 20 0.001 0.000 0.001 0.000 copy_reg.py:95(_slotnames) 41 0.001 0.000 0.001 0.000 serializers.py:461(read_int) 40 0.001 0.000 0.002 0.000 serializers.py:179(_batched) 62 0.000 0.000 0.000 0.000 {method 'read' of 'file' objects} 20 0.000 0.000 71.072 3.554 rdd.py:863(<lambda>) 20 0.000 0.000 0.001 0.000 serializers.py:198(load_stream) 40/20 0.000 0.000 71.072 3.554 rdd.py:2093(pipeline_func) 41 0.000 0.000 0.002 0.000 serializers.py:130(load_stream) 40 0.000 0.000 71.072 1.777 rdd.py:304(func) 20 0.000 0.000 71.094 3.555 worker.py:82(process) ``` Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk by `sc.dump_profiles(path)`, such as ```python >>> sc._conf.set("spark.python.profile", "true") >>> rdd = sc.parallelize(range(100)).map(str) >>> rdd.count() 100 >>> sc.show_profiles() ============================================================ Profile of RDD<id=1> ============================================================ 284 function calls (276 primitive calls) in 0.001 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 4 0.000 0.000 0.000 0.000 serializers.py:198(load_stream) 4 0.000 0.000 0.000 0.000 {reduce} 12/4 0.000 0.000 0.001 0.000 rdd.py:2092(pipeline_func) 4 0.000 0.000 0.000 0.000 {cPickle.loads} 4 0.000 0.000 0.000 0.000 {cPickle.dumps} 104 0.000 0.000 0.000 0.000 rdd.py:852(<genexpr>) 8 0.000 0.000 0.000 0.000 serializers.py:461(read_int) 12 0.000 0.000 0.000 0.000 rdd.py:303(func) ``` The profiling is disabled by default, can be enabled by "spark.python.profile=true". Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump" This is bugfix of #2351 cc @JoshRosen You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark profiler Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2556.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2556 ---- commit 4b20494ce4e5e287a09fee5df5e0684711258627 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-11T00:51:28Z add profile for python commit 0a5b6ebcd38f13fa15721c56a9d96bd9000529f5 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-11T03:25:23Z fix Python UDF commit 4f8309d7d8df18fb5f4da1d9f150d7606bf650c9 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-13T03:14:34Z address comment, add tests commit dadee1a228b20d24e4a6b0a7d081f1b30f773988 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-13T04:51:33Z add docs string and clear profiles after show or dump commit 15d6f18fd97422ff7bebf343383b7eca9ef433bc Author: Davies Liu <davies....@gmail.com> Date: 2014-09-13T05:09:06Z add docs for two configs commit c23865c6307963f97420d9213d6fb26ab0163f0d Author: Davies Liu <davies....@gmail.com> Date: 2014-09-13T05:14:19Z Merge branch 'master' into profiler commit 09d02c33496598533336a24e0c4ee84e3b6c5317 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-14T04:23:19Z Merge branch 'master' into profiler Conflicts: docs/configuration.md commit 116d52a1251140282a2cd5c49ad928b219c759b5 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-17T17:14:53Z Merge branch 'master' of github.com:apache/spark into profiler Conflicts: python/pyspark/worker.py commit fb9565b2afdd7fbaa1cc6cf4b1971fba2d9919b0 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-23T22:16:56Z Merge branch 'master' of github.com:apache/spark into profiler Conflicts: python/pyspark/worker.py commit cba94639fa6e5c4b2cb26f3152ea80bffaf65cce Author: Davies Liu <davies....@gmail.com> Date: 2014-09-24T23:05:06Z move show_profiles and dump_profiles to SparkContext commit 7a56c2420dd087cbe311d34fa81b5b9d22024b53 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-24T23:12:11Z bugfix commit 2b0daf207384b7cbf15a180bb05985fb596e8281 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-24T23:13:25Z fix docs commit 7ef2aa05cf07b2648cb73cd05f2ece93a44d9b9a Author: Davies Liu <davies....@gmail.com> Date: 2014-09-25T21:47:49Z bugfix, add tests for show_profiles and dump_profiles() commit 858e74caf5063e43fe7621716bc3e2048321ea00 Author: Davies Liu <davies....@gmail.com> Date: 2014-09-27T04:29:40Z compatitable with python 2.6 commit e68df5a2ada0044f76d748f4e5dd250a1928812b Author: Davies Liu <davies....@gmail.com> Date: 2014-09-27T04:30:11Z Merge branch 'master' of github.com:apache/spark into profiler ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org