[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2556 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2556#issuecomment-57407523 I've merged this. Thanks for the fix! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2556#issuecomment-57060226 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/169/consoleFull) for PR 2556 at commit [`e68df5a`](https://github.com/apache/spark/commit/e68df5a2ada0044f76d748f4e5dd250a1928812b). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2556#issuecomment-57058156 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/169/consoleFull) for PR 2556 at commit [`e68df5a`](https://github.com/apache/spark/commit/e68df5a2ada0044f76d748f4e5dd250a1928812b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2556#issuecomment-57042267 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20903/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2556 [SPARK-3478] [PySpark] Profile the Python tasks This patch add profiling support for PySpark, it will show the profiling results before the driver exits, here is one example: ``` Profile of RDD 5146507 function calls (5146487 primitive calls) in 71.094 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 5144576 68.3310.000 68.3310.000 statcounter.py:44(merge) 202.7350.137 71.0713.554 statcounter.py:33(__init__) 200.0170.0010.0170.001 {cPickle.dumps} 10240.0030.0000.0030.000 t.py:16() 200.0010.0000.0010.000 {reduce} 210.0010.0000.0010.000 {cPickle.loads} 200.0010.0000.0010.000 copy_reg.py:95(_slotnames) 410.0010.0000.0010.000 serializers.py:461(read_int) 400.0010.0000.0020.000 serializers.py:179(_batched) 620.0000.0000.0000.000 {method 'read' of 'file' objects} 200.0000.000 71.0723.554 rdd.py:863() 200.0000.0000.0010.000 serializers.py:198(load_stream) 40/200.0000.000 71.0723.554 rdd.py:2093(pipeline_func) 410.0000.0000.0020.000 serializers.py:130(load_stream) 400.0000.000 71.0721.777 rdd.py:304(func) 200.0000.000 71.0943.555 worker.py:82(process) ``` Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk by `sc.dump_profiles(path)`, such as ```python >>> sc._conf.set("spark.python.profile", "true") >>> rdd = sc.parallelize(range(100)).map(str) >>> rdd.count() 100 >>> sc.show_profiles() Profile of RDD 284 function calls (276 primitive calls) in 0.001 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 40.0000.0000.0000.000 serializers.py:198(load_stream) 40.0000.0000.0000.000 {reduce} 12/40.0000.0000.0010.000 rdd.py:2092(pipeline_func) 40.0000.0000.0000.000 {cPickle.loads} 40.0000.0000.0000.000 {cPickle.dumps} 1040.0000.0000.0000.000 rdd.py:852() 80.0000.0000.0000.000 serializers.py:461(read_int) 120.0000.0000.0000.000 rdd.py:303(func) ``` The profiling is disabled by default, can be enabled by "spark.python.profile=true". Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump" This is bugfix of #2351 cc @JoshRosen You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark profiler Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2556.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2556 commit 4b20494ce4e5e287a09fee5df5e0684711258627 Author: Davies Liu Date: 2014-09-11T00:51:28Z add profile for python commit 0a5b6ebcd38f13fa15721c56a9d96bd9000529f5 Author: Davies Liu Date: 2014-09-11T03:25:23Z fix Python UDF commit 4f8309d7d8df18fb5f4da1d9f150d7606bf650c9 Author: Davies Liu Date: 2014-09-13T03:14:34Z address comment, add tests commit dadee1a228b20d24e4a6b0a7d081f1b30f773988 Author: Davies Liu Date: 2014-09-13T04:51:33Z add docs string and clear profiles after show or dump commit 15d6f18fd97422ff7bebf343383b7eca9ef433bc Author: Davies Liu Date: 2014-09-13T05:09:06Z add docs for two configs commit c23865c6307963f97420d9213d6fb26ab0163f0d Author: Davies Liu Date: 2014-09-13T05:14:19Z Merge branch 'master' into profiler commit 09d02c3349659856a24e0c4ee84e3b6c5317 Author: Davies Liu Date: 2014-09-14T04:23:19Z Merge branch 'master' into profiler Conflicts: docs/configuration.md commit 116d52a1251140282a2cd5c49ad928b219c759b5 Author: Davies Liu Date: 2014-09-17T17:14:53Z Merge branch 'master' of github.com:apache/spark into profiler Conflicts: python/pyspark/worker.py commit fb9565b2afdd7fbaa1cc6cf4b1971fba2d99
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-57024838 Whoops, looks like this failed unit tests and caused a build-break. I'm going to revert it to un-break the build while we investigate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56988133 Thanks for review this, your comments made it much better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2351 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18098324 --- Diff: docs/configuration.md --- @@ -207,6 +207,25 @@ Apart from these, the following properties are also available, and may be useful + spark.python.profile + false + +Enable profiling in Python worker, the profile result will show up by `sc.show_profiles()`, +or it will be displayed before the driver exiting. It also can be dumped into disk by +`sc.dump_profiles(path)`. If some of the profile results had been displayed maually, +they will not be displayed automatically before driver exiting. --- End diff -- Ah, right. If it's been manually dumped, the it won't be dumped again when exiting. If it's been manually dumped _or_ displayed, then it won't be displayed when exiting. This makes sense; sorry for the confusion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18071567 --- Diff: docs/configuration.md --- @@ -207,6 +207,25 @@ Apart from these, the following properties are also available, and may be useful + spark.python.profile + false + +Enable profiling in Python worker, the profile result will show up by `sc.show_profiles()`, +or it will be displayed before the driver exiting. It also can be dumped into disk by +`sc.dump_profiles(path)`. If some of the profile results had been displayed maually, +they will not be displayed automatically before driver exiting. --- End diff -- if `showed` is true, it will not be displayed again, but will be dumped. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18070270 --- Diff: docs/configuration.md --- @@ -207,6 +207,25 @@ Apart from these, the following properties are also available, and may be useful + spark.python.profile + false + +Enable profiling in Python worker, the profile result will show up by `sc.show_profiles()`, +or it will be displayed before the driver exiting. It also can be dumped into disk by +`sc.dump_profiles(path)`. If some of the profile results had been displayed maually, +they will not be displayed automatically before driver exiting. --- End diff -- It looks like we clear `_profile_stats` when we perform manual `dump_profiles()` calls, but not when we call `show_profiles()`, so it seems like this is half-true (unless I've overlooked something). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18067596 --- Diff: docs/configuration.md --- @@ -207,6 +207,25 @@ Apart from these, the following properties are also available, and may be useful + spark.python.profile + false + +Enable profiling in Python worker, the profile result will show up by `sc.show_profiles()`, +or it will be displayed before the driver exiting. It also can be dumped into disk by +`sc.dump_profiles(path)`. If some of the profile results had been displayed maually, +they will not be displayed automatically before driver exiting. --- End diff -- I think it's true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18067340 --- Diff: docs/configuration.md --- @@ -207,6 +207,25 @@ Apart from these, the following properties are also available, and may be useful + spark.python.profile + false + +Enable profiling in Python worker, the profile result will show up by `sc.show_profiles()`, +or it will be displayed before the driver exiting. It also can be dumped into disk by +`sc.dump_profiles(path)`. If some of the profile results had been displayed maually, +they will not be displayed automatically before driver exiting. --- End diff -- Is this still true? It looks like we now use a `showed` flag to detect whether they've been printed instead of clearing the profiles array. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56897188 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/156/consoleFull) for PR 2351 at commit [`7ef2aa0`](https://github.com/apache/spark/commit/7ef2aa05cf07b2648cb73cd05f2ece93a44d9b9a). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56891432 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/156/consoleFull) for PR 2351 at commit [`7ef2aa0`](https://github.com/apache/spark/commit/7ef2aa05cf07b2648cb73cd05f2ece93a44d9b9a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56890919 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20822/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56889367 @JoshRosen sorry for this mistake, fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56863846 I noticed that we don't have any automated tests for `show_profiles()`, so I tested it manually and found a problem when running this file through `spark-submit`: ```python from pyspark import SparkContext, SparkConf conf = SparkConf() conf.set("spark.python.profile", "true") sc = SparkContext(appName="test", conf=conf) count = sc.parallelize(range(1)).count() sc.show_profiles() ``` This results in: ``` Traceback (most recent call last): File "/Users/joshrosen/Documents/spark/test.py", line 6, in sc.show_profiles() File "/Users/joshrosen/Documents/Spark/python/pyspark/context.py", line 811, in show_profiles for i, (id, acc, showed) in self._profile_stats: ValueError: too many values to unpack Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/Users/joshrosen/anaconda/lib/python2.7/atexit.py", line 24, in _run_exitfuncs func(*targs, **kargs) File "/Users/joshrosen/Documents/Spark/python/pyspark/context.py", line 811, in show_profiles for i, (id, acc, showed) in self._profile_stats: ValueError: too many values to unpack Error in sys.exitfunc: Traceback (most recent call last): File "/Users/joshrosen/anaconda/lib/python2.7/atexit.py", line 24, in _run_exitfuncs func(*targs, **kargs) File "/Users/joshrosen/Documents/Spark/python/pyspark/context.py", line 811, in show_profiles for i, (id, acc, showed) in self._profile_stats: ValueError: too many values to unpack ``` Can we add a test for this, too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56859664 This looks good to me. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56763368 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20771/consoleFull) for PR 2351 at commit [`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56763374 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20771/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56759309 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20771/consoleFull) for PR 2351 at commit [`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user shaneknapp commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56758906 jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56758832 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20769/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56758830 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20769/consoleFull) for PR 2351 at commit [`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56758654 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/147/consoleFull) for PR 2351 at commit [`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56753858 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20769/consoleFull) for PR 2351 at commit [`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56753750 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/147/consoleFull) for PR 2351 at commit [`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56753598 (I killed the test here so that I could re-run it with the newer commits). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56753544 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20767/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18006502 --- Diff: docs/configuration.md --- @@ -207,6 +207,25 @@ Apart from these, the following properties are also available, and may be useful + spark.python.profile + false + +Enable profiling in Python worker, the profile result will show up by `sc.show_profiles()`, +or it will be showed up before the driver exiting. It also can be dumped into disk by +`sc.dump_profiles(path)`. If some of the profile results had been showed up maually, +they will not be showed up automatically before driver exiting. + + + + spark.python.profile.dump + (none) + +The directory which is used to dump the profile result before driver exiting. +The results will be dumped as separated file for each RDD. They can be loaded +by ptats.Stats(). If this is specified, the profile result will not be showed up --- End diff -- Instead of "showed up", how about "displayed"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18006482 --- Diff: python/pyspark/context.py --- @@ -793,6 +796,40 @@ def runJob(self, rdd, partitionFunc, partitions=None, allowLocal=False): it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) return list(mappedRDD._collect_iterator_through_file(it)) +def _add_profile(self, id, profileAcc): +if not self._profile_stats: +dump_path = self._conf.get("spark.python.profile.dump") +if dump_path: +atexit.register(self.dump_profiles, dump_path) +else: +atexit.register(self.show_profiles) + +self._profile_stats.append([id, profileAcc, False]) + +def show_profiles(self): +""" Print the profile stats to stdout """ +for i, (id, acc, showed) in self._profile_stats: +stats = acc.value +if not showed and stats: +print "=" * 60 +print "Profile of RDD" % id +print "=" * 60 +stats.sort_stats("tottime", "cumtime").print_stats() +# mark it as showed +self._profile_stats[i][2] = True + +def dump_profiles(self, path): +""" Dump the profile stats into directory `path` +""" +if not os.path.exists(path): +os.makedirs(path) +for id, acc, _ in self._created_profiles: --- End diff -- This should probably be `self._profile_stats` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18006453 --- Diff: python/pyspark/rdd.py --- @@ -2025,6 +2025,7 @@ class PipelinedRDD(RDD): >>> rdd.flatMap(lambda x: [x, x]).reduce(add) 20 """ +_created_profiles = [] --- End diff -- Now that you've moved the other functions, I think you need to move this to SparkContext. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56752977 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20767/consoleFull) for PR 2351 at commit [`cba9463`](https://github.com/apache/spark/commit/cba94639fa6e5c4b2cb26f3152ea80bffaf65cce). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56752778 @JoshRosen I had addressed your comments, plz take another look, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18005737 --- Diff: python/pyspark/rdd.py --- @@ -2081,8 +2085,44 @@ def _jrdd(self): self.ctx.pythonExec, broadcast_vars, self.ctx._javaAccumulator) self._jrdd_val = python_rdd.asJavaRDD() + +if enable_profile: +self._id = self._jrdd_val.id() +if not self._created_profiles: +dump_path = self.ctx._conf.get("spark.python.profile.dump") +if dump_path: +atexit.register(PipelinedRDD.dump_profile, dump_path) +else: +atexit.register(PipelinedRDD.show_profile) +self._created_profiles.append((self._id, profileStats)) + return self._jrdd_val +@classmethod +def show_profile(cls): --- End diff -- To avoid potential confusion, what do you think about having `show_profile` and `dump_profile` throw exceptions if `spark.python.profile` is false? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18005654 --- Diff: python/pyspark/rdd.py --- @@ -2081,8 +2085,44 @@ def _jrdd(self): self.ctx.pythonExec, broadcast_vars, self.ctx._javaAccumulator) self._jrdd_val = python_rdd.asJavaRDD() + +if enable_profile: +self._id = self._jrdd_val.id() +if not self._created_profiles: +dump_path = self.ctx._conf.get("spark.python.profile.dump") +if dump_path: +atexit.register(PipelinedRDD.dump_profile, dump_path) +else: +atexit.register(PipelinedRDD.show_profile) +self._created_profiles.append((self._id, profileStats)) + return self._jrdd_val +@classmethod +def show_profile(cls): --- End diff -- That seems fine to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18002615 --- Diff: python/pyspark/rdd.py --- @@ -2081,8 +2085,44 @@ def _jrdd(self): self.ctx.pythonExec, broadcast_vars, self.ctx._javaAccumulator) self._jrdd_val = python_rdd.asJavaRDD() + +if enable_profile: +self._id = self._jrdd_val.id() +if not self._created_profiles: +dump_path = self.ctx._conf.get("spark.python.profile.dump") +if dump_path: +atexit.register(PipelinedRDD.dump_profile, dump_path) +else: +atexit.register(PipelinedRDD.show_profile) +self._created_profiles.append((self._id, profileStats)) + return self._jrdd_val +@classmethod +def show_profile(cls): --- End diff -- How about put it in SparkContext? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r17993979 --- Diff: python/pyspark/rdd.py --- @@ -2081,8 +2085,44 @@ def _jrdd(self): self.ctx.pythonExec, broadcast_vars, self.ctx._javaAccumulator) self._jrdd_val = python_rdd.asJavaRDD() + +if enable_profile: +self._id = self._jrdd_val.id() +if not self._created_profiles: +dump_path = self.ctx._conf.get("spark.python.profile.dump") +if dump_path: +atexit.register(PipelinedRDD.dump_profile, dump_path) +else: +atexit.register(PipelinedRDD.show_profile) +self._created_profiles.append((self._id, profileStats)) + return self._jrdd_val +@classmethod +def show_profile(cls): +""" Print the profile stats to stdout """ +for id, acc in cls._created_profiles: +stats = acc.value +if stats: +print "=" * 60 +print "Profile of RDD" % id +print "=" * 60 +stats.sort_stats("tottime", "cumtime").print_stats() +cls._created_profiles = [] --- End diff -- Should we document that this clears the created profiles? I guess the intended usage here is to run a bunch of code interactively then print the profiling data for everything that's run since the last time I called `show_profile`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56723672 I really like this approach to profiling; it's a very clever use of accumulators. My only feedback concerns UX / UI issues (see a few comments above RE: configuration options and docs). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r17993687 --- Diff: docs/configuration.md --- @@ -207,6 +207,22 @@ Apart from these, the following properties are also available, and may be useful + spark.python.profile + false + +Enable profiling in Python worker, the profile result will show up by `rdd.show_profile()`, +or it will show up before the driver exit. It also can be dumped into disk by +`rdd.dump_profile(path)`. + + + + spark.python.profile.dump + (none) + +The directory which is used to dump the profile result. The results will be dumped +as sepereted file for each RDD. They can be loaded by ptats.Stats(). --- End diff -- Maybe the docs here could be a little more explicit about how setting this option enables automatic profile dumping when the driver exits and disables automatic printing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r17993579 --- Diff: docs/configuration.md --- @@ -207,6 +207,22 @@ Apart from these, the following properties are also available, and may be useful + spark.python.profile + false + +Enable profiling in Python worker, the profile result will show up by `rdd.show_profile()`, +or it will show up before the driver exit. It also can be dumped into disk by +`rdd.dump_profile(path)`. + + + + spark.python.profile.dump + (none) + +The directory which is used to dump the profile result. The results will be dumped +as sepereted file for each RDD. They can be loaded by ptats.Stats(). --- End diff -- Typo: "sepereted" -> "a separate". It looks like this `spark.python.profile.dump` is only used for dumping files when the job exits? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r17992966 --- Diff: python/pyspark/rdd.py --- @@ -2081,8 +2085,44 @@ def _jrdd(self): self.ctx.pythonExec, broadcast_vars, self.ctx._javaAccumulator) self._jrdd_val = python_rdd.asJavaRDD() + +if enable_profile: +self._id = self._jrdd_val.id() +if not self._created_profiles: +dump_path = self.ctx._conf.get("spark.python.profile.dump") +if dump_path: +atexit.register(PipelinedRDD.dump_profile, dump_path) +else: +atexit.register(PipelinedRDD.show_profile) +self._created_profiles.append((self._id, profileStats)) + return self._jrdd_val +@classmethod +def show_profile(cls): +""" Print the profile stats to stdout """ +for id, acc in cls._created_profiles: +stats = acc.value +if stats: +print "=" * 60 +print "Profile of RDD" % id +print "=" * 60 +stats.sort_stats("tottime", "cumtime").print_stats() +cls._created_profiles = [] + +@classmethod +def dump_profile(cls, dump_path): --- End diff -- Maybe this should be `dump_profiles` plural or `dump_profiling_data`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r17992850 --- Diff: python/pyspark/rdd.py --- @@ -2081,8 +2085,44 @@ def _jrdd(self): self.ctx.pythonExec, broadcast_vars, self.ctx._javaAccumulator) self._jrdd_val = python_rdd.asJavaRDD() + +if enable_profile: +self._id = self._jrdd_val.id() +if not self._created_profiles: +dump_path = self.ctx._conf.get("spark.python.profile.dump") +if dump_path: +atexit.register(PipelinedRDD.dump_profile, dump_path) +else: +atexit.register(PipelinedRDD.show_profile) +self._created_profiles.append((self._id, profileStats)) + return self._jrdd_val +@classmethod +def show_profile(cls): --- End diff -- What do you think about introducing a new class to hold the PySpark profiling methods? We could call it something like `Profiling` and put it in `profiling.py`. It seems a little weird to allow `show_profile()` to be called on a particular RDD instance when the method prints all created profiles. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56605482 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20730/consoleFull) for PR 2351 at commit [`fb9565b`](https://github.com/apache/spark/commit/fb9565b2afdd7fbaa1cc6cf4b1971fba2d9919b0). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56605488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20730/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56599152 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20730/consoleFull) for PR 2351 at commit [`fb9565b`](https://github.com/apache/spark/commit/fb9565b2afdd7fbaa1cc6cf4b1971fba2d9919b0). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55938151 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20469/consoleFull) for PR 2351 at commit [`116d52a`](https://github.com/apache/spark/commit/116d52a1251140282a2cd5c49ad928b219c759b5). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55928227 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20469/consoleFull) for PR 2351 at commit [`116d52a`](https://github.com/apache/spark/commit/116d52a1251140282a2cd5c49ad928b219c759b5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55516221 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20297/consoleFull) for PR 2351 at commit [`09d02c3`](https://github.com/apache/spark/commit/09d02c3349659856a24e0c4ee84e3b6c5317). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55515166 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20297/consoleFull) for PR 2351 at commit [`09d02c3`](https://github.com/apache/spark/commit/09d02c3349659856a24e0c4ee84e3b6c5317). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55483390 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20252/consoleFull) for PR 2351 at commit [`15d6f18`](https://github.com/apache/spark/commit/15d6f18fd97422ff7bebf343383b7eca9ef433bc). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception ` * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55482333 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20253/consoleFull) for PR 2351 at commit [`c23865c`](https://github.com/apache/spark/commit/c23865c6307963f97420d9213d6fb26ab0163f0d). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception ` * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55482315 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20253/consoleFull) for PR 2351 at commit [`c23865c`](https://github.com/apache/spark/commit/c23865c6307963f97420d9213d6fb26ab0163f0d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55482240 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20252/consoleFull) for PR 2351 at commit [`15d6f18`](https://github.com/apache/spark/commit/15d6f18fd97422ff7bebf343383b7eca9ef433bc). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55482196 @JoshRosen I had addressed your comment, also added docs for configs and tests. I realized that the profile result also can be showed interactively, by rdd.show_profile(), I had updated the PR description for this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55481451 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20250/consoleFull) for PR 2351 at commit [`4f8309d`](https://github.com/apache/spark/commit/4f8309d7d8df18fb5f4da1d9f150d7606bf650c9). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PStatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55480246 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20250/consoleFull) for PR 2351 at commit [`4f8309d`](https://github.com/apache/spark/commit/4f8309d7d8df18fb5f4da1d9f150d7606bf650c9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r17509328 --- Diff: python/pyspark/accumulators.py --- @@ -215,6 +215,21 @@ def addInPlace(self, value1, value2): COMPLEX_ACCUMULATOR_PARAM = AddingAccumulatorParam(0.0j) +class StatsParam(AccumulatorParam): --- End diff -- Do you think it would be clearer to name this `ProfilingStatsParam` or `PStatsParam`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55224983 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/66/consoleFull) for PR 2351 at commit [`0a5b6eb`](https://github.com/apache/spark/commit/0a5b6ebcd38f13fa15721c56a9d96bd9000529f5). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class StatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55220913 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/66/consoleFull) for PR 2351 at commit [`0a5b6eb`](https://github.com/apache/spark/commit/0a5b6ebcd38f13fa15721c56a9d96bd9000529f5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55219072 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20135/consoleFull) for PR 2351 at commit [`0a5b6eb`](https://github.com/apache/spark/commit/0a5b6ebcd38f13fa15721c56a9d96bd9000529f5). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class StatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55216348 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20135/consoleFull) for PR 2351 at commit [`0a5b6eb`](https://github.com/apache/spark/commit/0a5b6ebcd38f13fa15721c56a9d96bd9000529f5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55212763 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20128/consoleFull) for PR 2351 at commit [`4b20494`](https://github.com/apache/spark/commit/4b20494ce4e5e287a09fee5df5e0684711258627). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class StatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55211000 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/48/consoleFull) for PR 2351 at commit [`4b20494`](https://github.com/apache/spark/commit/4b20494ce4e5e287a09fee5df5e0684711258627). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class StatsParam(AccumulatorParam):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55209533 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20128/consoleFull) for PR 2351 at commit [`4b20494`](https://github.com/apache/spark/commit/4b20494ce4e5e287a09fee5df5e0684711258627). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55207154 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/48/consoleFull) for PR 2351 at commit [`4b20494`](https://github.com/apache/spark/commit/4b20494ce4e5e287a09fee5df5e0684711258627). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2351 [SPARK-3478] [PySpark] Profile the Python tasks This patch add profiling support for PySpark, it will show the profiling results before the driver exits, here is one example: ``` Profile of RDD 5146507 function calls (5146487 primitive calls) in 71.094 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 5144576 68.3310.000 68.3310.000 statcounter.py:44(merge) 202.7350.137 71.0713.554 statcounter.py:33(__init__) 200.0170.0010.0170.001 {cPickle.dumps} 10240.0030.0000.0030.000 t.py:16() 200.0010.0000.0010.000 {reduce} 210.0010.0000.0010.000 {cPickle.loads} 200.0010.0000.0010.000 copy_reg.py:95(_slotnames) 410.0010.0000.0010.000 serializers.py:461(read_int) 400.0010.0000.0020.000 serializers.py:179(_batched) 620.0000.0000.0000.000 {method 'read' of 'file' objects} 200.0000.000 71.0723.554 rdd.py:863() 200.0000.0000.0010.000 serializers.py:198(load_stream) 40/200.0000.000 71.0723.554 rdd.py:2093(pipeline_func) 410.0000.0000.0020.000 serializers.py:130(load_stream) 400.0000.000 71.0721.777 rdd.py:304(func) 200.0000.000 71.0943.555 worker.py:82(process) 400.0000.0000.0010.000 rdd.py:741(func) 200.0000.0000.0180.001 serializers.py:137(_write_with_length) 200.0000.0000.0200.001 serializers.py:195(dump_stream) 200.0000.0000.0000.000 serializers.py:201(_load_stream_without_unbatching) 200.0000.0000.0000.000 {hasattr} 410.0000.0000.0020.000 serializers.py:145(_read_with_length) 400.0000.0000.0000.000 {built-in method from_iterable} 200.0000.0000.0000.000 serializers.py:468(write_int) 200.0000.0000.0180.001 serializers.py:355(dumps) 200.0000.0000.0200.001 serializers.py:126(dump_stream) 200.0000.0000.0000.000 {method 'get' of 'dictproxy' objects} 200.0000.0000.0000.000 rdd.py:291(func) 400.0000.0000.0000.000 {method 'write' of 'file' objects} 200.0000.0000.0000.000 {_struct.pack} 210.0000.0000.0000.000 {_struct.unpack} 200.0000.0000.0000.000 {iter} 200.0000.0000.0000.000 {method 'append' of 'list' objects} 200.0000.0000.0000.000 {len} 200.0000.0000.0000.000 {next} 200.0000.0000.0000.000 {method 'disable' of '_lsprof.Profiler' objects} ``` The profiling is disabled by default, can be enabled by "spark.python.profile=true". Also, users can dump the results into disks for future analysis, by "spark.python.profile.dump=path_to_dump" PS: will update the docs later. You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark profiler Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2351.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2351 commit 4b20494ce4e5e287a09fee5df5e0684711258627 Author: Davies Liu Date: 2014-09-11T00:51:28Z add profile for python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org