[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2556


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-30 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2556#issuecomment-57407523
  
I've merged this.  Thanks for the fix!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2556#issuecomment-57060226
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/169/consoleFull)
 for   PR 2556 at commit 
[`e68df5a`](https://github.com/apache/spark/commit/e68df5a2ada0044f76d748f4e5dd250a1928812b).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2556#issuecomment-57058156
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/169/consoleFull)
 for   PR 2556 at commit 
[`e68df5a`](https://github.com/apache/spark/commit/e68df5a2ada0044f76d748f4e5dd250a1928812b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2556#issuecomment-57042267
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20903/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-26 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/2556

[SPARK-3478] [PySpark] Profile the Python tasks

This patch add profiling support for PySpark, it will show the profiling 
results
before the driver exits, here is one example:

```

Profile of RDD

 5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.3310.000   68.3310.000 statcounter.py:44(merge)
   202.7350.137   71.0713.554 statcounter.py:33(__init__)
   200.0170.0010.0170.001 {cPickle.dumps}
 10240.0030.0000.0030.000 t.py:16()
   200.0010.0000.0010.000 {reduce}
   210.0010.0000.0010.000 {cPickle.loads}
   200.0010.0000.0010.000 copy_reg.py:95(_slotnames)
   410.0010.0000.0010.000 serializers.py:461(read_int)
   400.0010.0000.0020.000 serializers.py:179(_batched)
   620.0000.0000.0000.000 {method 'read' of 'file' 
objects}
   200.0000.000   71.0723.554 rdd.py:863()
   200.0000.0000.0010.000 
serializers.py:198(load_stream)
40/200.0000.000   71.0723.554 rdd.py:2093(pipeline_func)
   410.0000.0000.0020.000 
serializers.py:130(load_stream)
   400.0000.000   71.0721.777 rdd.py:304(func)
   200.0000.000   71.0943.555 worker.py:82(process)
```

Also, use can show profile result manually by `sc.show_profiles()` or dump 
it into disk
by `sc.dump_profiles(path)`, such as

```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()

Profile of RDD

 284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
40.0000.0000.0000.000 
serializers.py:198(load_stream)
40.0000.0000.0000.000 {reduce}
 12/40.0000.0000.0010.000 rdd.py:2092(pipeline_func)
40.0000.0000.0000.000 {cPickle.loads}
40.0000.0000.0000.000 {cPickle.dumps}
  1040.0000.0000.0000.000 rdd.py:852()
80.0000.0000.0000.000 serializers.py:461(read_int)
   120.0000.0000.0000.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by 
"spark.python.profile=true".

Also, users can dump the results into disks automatically for future 
analysis, by "spark.python.profile.dump=path_to_dump"

This is bugfix of #2351 cc @JoshRosen 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark profiler

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2556


commit 4b20494ce4e5e287a09fee5df5e0684711258627
Author: Davies Liu 
Date:   2014-09-11T00:51:28Z

add profile for python

commit 0a5b6ebcd38f13fa15721c56a9d96bd9000529f5
Author: Davies Liu 
Date:   2014-09-11T03:25:23Z

fix Python UDF

commit 4f8309d7d8df18fb5f4da1d9f150d7606bf650c9
Author: Davies Liu 
Date:   2014-09-13T03:14:34Z

address comment, add tests

commit dadee1a228b20d24e4a6b0a7d081f1b30f773988
Author: Davies Liu 
Date:   2014-09-13T04:51:33Z

add docs string and clear profiles after show or dump

commit 15d6f18fd97422ff7bebf343383b7eca9ef433bc
Author: Davies Liu 
Date:   2014-09-13T05:09:06Z

add docs for two configs

commit c23865c6307963f97420d9213d6fb26ab0163f0d
Author: Davies Liu 
Date:   2014-09-13T05:14:19Z

Merge branch 'master' into profiler

commit 09d02c3349659856a24e0c4ee84e3b6c5317
Author: Davies Liu 
Date:   2014-09-14T04:23:19Z

Merge branch 'master' into profiler

Conflicts:
docs/configuration.md

commit 116d52a1251140282a2cd5c49ad928b219c759b5
Author: Davies Liu 
Date:   2014-09-17T17:14:53Z

Merge branch 'master' of github.com:apache/spark into profiler

Conflicts:
python/pyspark/worker.py

commit fb9565b2afdd7fbaa1cc6cf4b1971fba2d99

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-26 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-57024838
  
Whoops, looks like this failed unit tests and caused a build-break.  I'm 
going to revert it to un-break the build while we investigate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-26 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56988133
  
Thanks for review this, your comments made it much better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2351


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18098324
  
--- Diff: docs/configuration.md ---
@@ -207,6 +207,25 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.python.profile
+  false
+  
+Enable profiling in Python worker, the profile result will show up by 
`sc.show_profiles()`,
+or it will be displayed before the driver exiting. It also can be 
dumped into disk by
+`sc.dump_profiles(path)`. If some of the profile results had been 
displayed maually,
+they will not be displayed automatically before driver exiting.
--- End diff --

Ah, right.  If it's been manually dumped, the it won't be dumped again when 
exiting.  If it's been manually dumped _or_ displayed, then it won't be 
displayed when exiting.

This makes sense; sorry for the confusion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18071567
  
--- Diff: docs/configuration.md ---
@@ -207,6 +207,25 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.python.profile
+  false
+  
+Enable profiling in Python worker, the profile result will show up by 
`sc.show_profiles()`,
+or it will be displayed before the driver exiting. It also can be 
dumped into disk by
+`sc.dump_profiles(path)`. If some of the profile results had been 
displayed maually,
+they will not be displayed automatically before driver exiting.
--- End diff --

if `showed` is true, it will not be displayed again, but will be dumped.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18070270
  
--- Diff: docs/configuration.md ---
@@ -207,6 +207,25 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.python.profile
+  false
+  
+Enable profiling in Python worker, the profile result will show up by 
`sc.show_profiles()`,
+or it will be displayed before the driver exiting. It also can be 
dumped into disk by
+`sc.dump_profiles(path)`. If some of the profile results had been 
displayed maually,
+they will not be displayed automatically before driver exiting.
--- End diff --

It looks like we clear `_profile_stats` when we perform manual 
`dump_profiles()` calls, but not when we call `show_profiles()`, so it seems 
like this is half-true (unless I've overlooked something).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18067596
  
--- Diff: docs/configuration.md ---
@@ -207,6 +207,25 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.python.profile
+  false
+  
+Enable profiling in Python worker, the profile result will show up by 
`sc.show_profiles()`,
+or it will be displayed before the driver exiting. It also can be 
dumped into disk by
+`sc.dump_profiles(path)`. If some of the profile results had been 
displayed maually,
+they will not be displayed automatically before driver exiting.
--- End diff --

I think it's true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18067340
  
--- Diff: docs/configuration.md ---
@@ -207,6 +207,25 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.python.profile
+  false
+  
+Enable profiling in Python worker, the profile result will show up by 
`sc.show_profiles()`,
+or it will be displayed before the driver exiting. It also can be 
dumped into disk by
+`sc.dump_profiles(path)`. If some of the profile results had been 
displayed maually,
+they will not be displayed automatically before driver exiting.
--- End diff --

Is this still true?  It looks like we now use a `showed` flag to detect 
whether they've been printed instead of clearing the profiles array.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56897188
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/156/consoleFull)
 for   PR 2351 at commit 
[`7ef2aa0`](https://github.com/apache/spark/commit/7ef2aa05cf07b2648cb73cd05f2ece93a44d9b9a).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56891432
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/156/consoleFull)
 for   PR 2351 at commit 
[`7ef2aa0`](https://github.com/apache/spark/commit/7ef2aa05cf07b2648cb73cd05f2ece93a44d9b9a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56890919
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20822/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56889367
  
@JoshRosen sorry for this mistake, fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56863846
  
I noticed that we don't have any automated tests for `show_profiles()`, so 
I tested it manually and found a problem when running this file through 
`spark-submit`:

```python
from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.set("spark.python.profile", "true")
sc = SparkContext(appName="test", conf=conf)
count = sc.parallelize(range(1)).count()
sc.show_profiles()
```

This results in:

```
Traceback (most recent call last):
  File "/Users/joshrosen/Documents/spark/test.py", line 6, in 
sc.show_profiles()
  File "/Users/joshrosen/Documents/Spark/python/pyspark/context.py", line 
811, in show_profiles
for i, (id, acc, showed) in self._profile_stats:
ValueError: too many values to unpack
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/joshrosen/anaconda/lib/python2.7/atexit.py", line 24, in 
_run_exitfuncs
func(*targs, **kargs)
  File "/Users/joshrosen/Documents/Spark/python/pyspark/context.py", line 
811, in show_profiles
for i, (id, acc, showed) in self._profile_stats:
ValueError: too many values to unpack
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/Users/joshrosen/anaconda/lib/python2.7/atexit.py", line 24, in 
_run_exitfuncs
func(*targs, **kargs)
  File "/Users/joshrosen/Documents/Spark/python/pyspark/context.py", line 
811, in show_profiles
for i, (id, acc, showed) in self._profile_stats:
ValueError: too many values to unpack
```

Can we add a test for this, too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56859664
  
This looks good to me.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56763368
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20771/consoleFull)
 for   PR 2351 at commit 
[`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56763374
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20771/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56759309
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20771/consoleFull)
 for   PR 2351 at commit 
[`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread shaneknapp
Github user shaneknapp commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56758906
  
jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56758832
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20769/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56758830
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20769/consoleFull)
 for   PR 2351 at commit 
[`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56758654
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/147/consoleFull)
 for   PR 2351 at commit 
[`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56753858
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20769/consoleFull)
 for   PR 2351 at commit 
[`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56753750
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/147/consoleFull)
 for   PR 2351 at commit 
[`2b0daf2`](https://github.com/apache/spark/commit/2b0daf207384b7cbf15a180bb05985fb596e8281).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56753598
  
(I killed the test here so that I could re-run it with the newer commits).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56753544
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20767/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18006502
  
--- Diff: docs/configuration.md ---
@@ -207,6 +207,25 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.python.profile
+  false
+  
+Enable profiling in Python worker, the profile result will show up by 
`sc.show_profiles()`,
+or it will be showed up before the driver exiting. It also can be 
dumped into disk by
+`sc.dump_profiles(path)`. If some of the profile results had been 
showed up maually,
+they will not be showed up automatically before driver exiting.
+  
+
+
+  spark.python.profile.dump
+  (none)
+  
+The directory which is used to dump the profile result before driver 
exiting. 
+The results will be dumped as separated file for each RDD. They can be 
loaded
+by ptats.Stats(). If this is specified, the profile result will not be 
showed up
--- End diff --

Instead of "showed up", how about "displayed"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18006482
  
--- Diff: python/pyspark/context.py ---
@@ -793,6 +796,40 @@ def runJob(self, rdd, partitionFunc, partitions=None, 
allowLocal=False):
 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
javaPartitions, allowLocal)
 return list(mappedRDD._collect_iterator_through_file(it))
 
+def _add_profile(self, id, profileAcc):
+if not self._profile_stats:
+dump_path = self._conf.get("spark.python.profile.dump")
+if dump_path:
+atexit.register(self.dump_profiles, dump_path)
+else:
+atexit.register(self.show_profiles)
+
+self._profile_stats.append([id, profileAcc, False])
+
+def show_profiles(self):
+""" Print the profile stats to stdout """
+for i, (id, acc, showed) in self._profile_stats:
+stats = acc.value
+if not showed and stats:
+print "=" * 60
+print "Profile of RDD" % id
+print "=" * 60
+stats.sort_stats("tottime", "cumtime").print_stats()
+# mark it as showed
+self._profile_stats[i][2] = True
+
+def dump_profiles(self, path):
+""" Dump the profile stats into directory `path`
+"""
+if not os.path.exists(path):
+os.makedirs(path)
+for id, acc, _ in self._created_profiles:
--- End diff --

This should probably be `self._profile_stats`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18006453
  
--- Diff: python/pyspark/rdd.py ---
@@ -2025,6 +2025,7 @@ class PipelinedRDD(RDD):
 >>> rdd.flatMap(lambda x: [x, x]).reduce(add)
 20
 """
+_created_profiles = []
--- End diff --

Now that you've moved the other functions, I think you need to move this to 
SparkContext.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56752977
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20767/consoleFull)
 for   PR 2351 at commit 
[`cba9463`](https://github.com/apache/spark/commit/cba94639fa6e5c4b2cb26f3152ea80bffaf65cce).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56752778
  
@JoshRosen I had addressed your comments, plz take another look, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18005737
  
--- Diff: python/pyspark/rdd.py ---
@@ -2081,8 +2085,44 @@ def _jrdd(self):
  self.ctx.pythonExec,
  broadcast_vars, 
self.ctx._javaAccumulator)
 self._jrdd_val = python_rdd.asJavaRDD()
+
+if enable_profile:
+self._id = self._jrdd_val.id()
+if not self._created_profiles:
+dump_path = self.ctx._conf.get("spark.python.profile.dump")
+if dump_path:
+atexit.register(PipelinedRDD.dump_profile, dump_path)
+else:
+atexit.register(PipelinedRDD.show_profile)
+self._created_profiles.append((self._id, profileStats))
+
 return self._jrdd_val
 
+@classmethod
+def show_profile(cls):
--- End diff --

To avoid potential confusion, what do you think about having `show_profile` 
and `dump_profile` throw exceptions if `spark.python.profile` is false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18005654
  
--- Diff: python/pyspark/rdd.py ---
@@ -2081,8 +2085,44 @@ def _jrdd(self):
  self.ctx.pythonExec,
  broadcast_vars, 
self.ctx._javaAccumulator)
 self._jrdd_val = python_rdd.asJavaRDD()
+
+if enable_profile:
+self._id = self._jrdd_val.id()
+if not self._created_profiles:
+dump_path = self.ctx._conf.get("spark.python.profile.dump")
+if dump_path:
+atexit.register(PipelinedRDD.dump_profile, dump_path)
+else:
+atexit.register(PipelinedRDD.show_profile)
+self._created_profiles.append((self._id, profileStats))
+
 return self._jrdd_val
 
+@classmethod
+def show_profile(cls):
--- End diff --

That seems fine to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r18002615
  
--- Diff: python/pyspark/rdd.py ---
@@ -2081,8 +2085,44 @@ def _jrdd(self):
  self.ctx.pythonExec,
  broadcast_vars, 
self.ctx._javaAccumulator)
 self._jrdd_val = python_rdd.asJavaRDD()
+
+if enable_profile:
+self._id = self._jrdd_val.id()
+if not self._created_profiles:
+dump_path = self.ctx._conf.get("spark.python.profile.dump")
+if dump_path:
+atexit.register(PipelinedRDD.dump_profile, dump_path)
+else:
+atexit.register(PipelinedRDD.show_profile)
+self._created_profiles.append((self._id, profileStats))
+
 return self._jrdd_val
 
+@classmethod
+def show_profile(cls):
--- End diff --

How about put it in SparkContext?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r17993979
  
--- Diff: python/pyspark/rdd.py ---
@@ -2081,8 +2085,44 @@ def _jrdd(self):
  self.ctx.pythonExec,
  broadcast_vars, 
self.ctx._javaAccumulator)
 self._jrdd_val = python_rdd.asJavaRDD()
+
+if enable_profile:
+self._id = self._jrdd_val.id()
+if not self._created_profiles:
+dump_path = self.ctx._conf.get("spark.python.profile.dump")
+if dump_path:
+atexit.register(PipelinedRDD.dump_profile, dump_path)
+else:
+atexit.register(PipelinedRDD.show_profile)
+self._created_profiles.append((self._id, profileStats))
+
 return self._jrdd_val
 
+@classmethod
+def show_profile(cls):
+""" Print the profile stats to stdout """
+for id, acc in cls._created_profiles:
+stats = acc.value
+if stats:
+print "=" * 60
+print "Profile of RDD" % id
+print "=" * 60
+stats.sort_stats("tottime", "cumtime").print_stats()
+cls._created_profiles = []
--- End diff --

Should we document that this clears the created profiles?  I guess the 
intended usage here is to run a bunch of code interactively then print the 
profiling data for everything that's run since the last time I called 
`show_profile`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56723672
  
I really like this approach to profiling; it's a very clever use of 
accumulators.  My only feedback concerns UX / UI issues (see a few comments 
above RE: configuration options and docs).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r17993687
  
--- Diff: docs/configuration.md ---
@@ -207,6 +207,22 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.python.profile
+  false
+  
+Enable profiling in Python worker, the profile result will show up by 
`rdd.show_profile()`,
+or it will show up before the driver exit. It also can be dumped into 
disk by
+`rdd.dump_profile(path)`.
+  
+
+
+  spark.python.profile.dump
+  (none)
+  
+The directory which is used to dump the profile result. The results 
will be dumped
+as sepereted file for each RDD. They can be loaded by ptats.Stats().
--- End diff --

Maybe the docs here could be a little more explicit about how setting this 
option enables automatic profile dumping when the driver exits and disables 
automatic printing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r17993579
  
--- Diff: docs/configuration.md ---
@@ -207,6 +207,22 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.python.profile
+  false
+  
+Enable profiling in Python worker, the profile result will show up by 
`rdd.show_profile()`,
+or it will show up before the driver exit. It also can be dumped into 
disk by
+`rdd.dump_profile(path)`.
+  
+
+
+  spark.python.profile.dump
+  (none)
+  
+The directory which is used to dump the profile result. The results 
will be dumped
+as sepereted file for each RDD. They can be loaded by ptats.Stats().
--- End diff --

Typo: "sepereted" -> "a separate".

It looks like this `spark.python.profile.dump` is only used for dumping 
files when the job exits?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r17992966
  
--- Diff: python/pyspark/rdd.py ---
@@ -2081,8 +2085,44 @@ def _jrdd(self):
  self.ctx.pythonExec,
  broadcast_vars, 
self.ctx._javaAccumulator)
 self._jrdd_val = python_rdd.asJavaRDD()
+
+if enable_profile:
+self._id = self._jrdd_val.id()
+if not self._created_profiles:
+dump_path = self.ctx._conf.get("spark.python.profile.dump")
+if dump_path:
+atexit.register(PipelinedRDD.dump_profile, dump_path)
+else:
+atexit.register(PipelinedRDD.show_profile)
+self._created_profiles.append((self._id, profileStats))
+
 return self._jrdd_val
 
+@classmethod
+def show_profile(cls):
+""" Print the profile stats to stdout """
+for id, acc in cls._created_profiles:
+stats = acc.value
+if stats:
+print "=" * 60
+print "Profile of RDD" % id
+print "=" * 60
+stats.sort_stats("tottime", "cumtime").print_stats()
+cls._created_profiles = []
+
+@classmethod
+def dump_profile(cls, dump_path):
--- End diff --

Maybe this should be `dump_profiles` plural or `dump_profiling_data`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r17992850
  
--- Diff: python/pyspark/rdd.py ---
@@ -2081,8 +2085,44 @@ def _jrdd(self):
  self.ctx.pythonExec,
  broadcast_vars, 
self.ctx._javaAccumulator)
 self._jrdd_val = python_rdd.asJavaRDD()
+
+if enable_profile:
+self._id = self._jrdd_val.id()
+if not self._created_profiles:
+dump_path = self.ctx._conf.get("spark.python.profile.dump")
+if dump_path:
+atexit.register(PipelinedRDD.dump_profile, dump_path)
+else:
+atexit.register(PipelinedRDD.show_profile)
+self._created_profiles.append((self._id, profileStats))
+
 return self._jrdd_val
 
+@classmethod
+def show_profile(cls):
--- End diff --

What do you think about introducing a new class to hold the PySpark 
profiling methods?  We could call it something like `Profiling` and put it in 
`profiling.py`.

It seems a little weird to allow `show_profile()` to be called on a 
particular RDD instance when the method prints all created profiles.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56605482
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20730/consoleFull)
 for   PR 2351 at commit 
[`fb9565b`](https://github.com/apache/spark/commit/fb9565b2afdd7fbaa1cc6cf4b1971fba2d9919b0).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56605488
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20730/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-56599152
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20730/consoleFull)
 for   PR 2351 at commit 
[`fb9565b`](https://github.com/apache/spark/commit/fb9565b2afdd7fbaa1cc6cf4b1971fba2d9919b0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55938151
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20469/consoleFull)
 for   PR 2351 at commit 
[`116d52a`](https://github.com/apache/spark/commit/116d52a1251140282a2cd5c49ad928b219c759b5).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55928227
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20469/consoleFull)
 for   PR 2351 at commit 
[`116d52a`](https://github.com/apache/spark/commit/116d52a1251140282a2cd5c49ad928b219c759b5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55516221
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20297/consoleFull)
 for   PR 2351 at commit 
[`09d02c3`](https://github.com/apache/spark/commit/09d02c3349659856a24e0c4ee84e3b6c5317).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55515166
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20297/consoleFull)
 for   PR 2351 at commit 
[`09d02c3`](https://github.com/apache/spark/commit/09d02c3349659856a24e0c4ee84e3b6c5317).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55483390
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20252/consoleFull)
 for   PR 2351 at commit 
[`15d6f18`](https://github.com/apache/spark/commit/15d6f18fd97422ff7bebf343383b7eca9ef433bc).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class TaskCompletionListenerException(errorMessages: Seq[String]) 
extends Exception `
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55482333
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20253/consoleFull)
 for   PR 2351 at commit 
[`c23865c`](https://github.com/apache/spark/commit/c23865c6307963f97420d9213d6fb26ab0163f0d).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class TaskCompletionListenerException(errorMessages: Seq[String]) 
extends Exception `
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55482315
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20253/consoleFull)
 for   PR 2351 at commit 
[`c23865c`](https://github.com/apache/spark/commit/c23865c6307963f97420d9213d6fb26ab0163f0d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55482240
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20252/consoleFull)
 for   PR 2351 at commit 
[`15d6f18`](https://github.com/apache/spark/commit/15d6f18fd97422ff7bebf343383b7eca9ef433bc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55482196
  
@JoshRosen I had addressed your comment, also added docs for configs and 
tests.

I realized that the profile result also can be showed interactively, by 
rdd.show_profile(), I had updated the PR description for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55481451
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20250/consoleFull)
 for   PR 2351 at commit 
[`4f8309d`](https://github.com/apache/spark/commit/4f8309d7d8df18fb5f4da1d9f150d7606bf650c9).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55480246
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20250/consoleFull)
 for   PR 2351 at commit 
[`4f8309d`](https://github.com/apache/spark/commit/4f8309d7d8df18fb5f4da1d9f150d7606bf650c9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2351#discussion_r17509328
  
--- Diff: python/pyspark/accumulators.py ---
@@ -215,6 +215,21 @@ def addInPlace(self, value1, value2):
 COMPLEX_ACCUMULATOR_PARAM = AddingAccumulatorParam(0.0j)
 
 
+class StatsParam(AccumulatorParam):
--- End diff --

Do you think it would be clearer to name this `ProfilingStatsParam` or 
`PStatsParam`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55224983
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/66/consoleFull)
 for   PR 2351 at commit 
[`0a5b6eb`](https://github.com/apache/spark/commit/0a5b6ebcd38f13fa15721c56a9d96bd9000529f5).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class StatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55220913
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/66/consoleFull)
 for   PR 2351 at commit 
[`0a5b6eb`](https://github.com/apache/spark/commit/0a5b6ebcd38f13fa15721c56a9d96bd9000529f5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55219072
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20135/consoleFull)
 for   PR 2351 at commit 
[`0a5b6eb`](https://github.com/apache/spark/commit/0a5b6ebcd38f13fa15721c56a9d96bd9000529f5).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class StatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55216348
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20135/consoleFull)
 for   PR 2351 at commit 
[`0a5b6eb`](https://github.com/apache/spark/commit/0a5b6ebcd38f13fa15721c56a9d96bd9000529f5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55212763
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20128/consoleFull)
 for   PR 2351 at commit 
[`4b20494`](https://github.com/apache/spark/commit/4b20494ce4e5e287a09fee5df5e0684711258627).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class StatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55211000
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/48/consoleFull)
 for   PR 2351 at commit 
[`4b20494`](https://github.com/apache/spark/commit/4b20494ce4e5e287a09fee5df5e0684711258627).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class StatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55209533
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20128/consoleFull)
 for   PR 2351 at commit 
[`4b20494`](https://github.com/apache/spark/commit/4b20494ce4e5e287a09fee5df5e0684711258627).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2351#issuecomment-55207154
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/48/consoleFull)
 for   PR 2351 at commit 
[`4b20494`](https://github.com/apache/spark/commit/4b20494ce4e5e287a09fee5df5e0684711258627).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-10 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/2351

[SPARK-3478] [PySpark] Profile the Python tasks

This patch add profiling support for PySpark, it will show the profiling 
results
before the driver exits, here is one example:

```

Profile of RDD

 5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.3310.000   68.3310.000 statcounter.py:44(merge)
   202.7350.137   71.0713.554 statcounter.py:33(__init__)
   200.0170.0010.0170.001 {cPickle.dumps}
 10240.0030.0000.0030.000 t.py:16()
   200.0010.0000.0010.000 {reduce}
   210.0010.0000.0010.000 {cPickle.loads}
   200.0010.0000.0010.000 copy_reg.py:95(_slotnames)
   410.0010.0000.0010.000 serializers.py:461(read_int)
   400.0010.0000.0020.000 serializers.py:179(_batched)
   620.0000.0000.0000.000 {method 'read' of 'file' 
objects}
   200.0000.000   71.0723.554 rdd.py:863()
   200.0000.0000.0010.000 
serializers.py:198(load_stream)
40/200.0000.000   71.0723.554 rdd.py:2093(pipeline_func)
   410.0000.0000.0020.000 
serializers.py:130(load_stream)
   400.0000.000   71.0721.777 rdd.py:304(func)
   200.0000.000   71.0943.555 worker.py:82(process)
   400.0000.0000.0010.000 rdd.py:741(func)
   200.0000.0000.0180.001 
serializers.py:137(_write_with_length)
   200.0000.0000.0200.001 
serializers.py:195(dump_stream)
   200.0000.0000.0000.000 
serializers.py:201(_load_stream_without_unbatching)
   200.0000.0000.0000.000 {hasattr}
   410.0000.0000.0020.000 
serializers.py:145(_read_with_length)
   400.0000.0000.0000.000 {built-in method 
from_iterable}
   200.0000.0000.0000.000 serializers.py:468(write_int)
   200.0000.0000.0180.001 serializers.py:355(dumps)
   200.0000.0000.0200.001 
serializers.py:126(dump_stream)
   200.0000.0000.0000.000 {method 'get' of 'dictproxy' 
objects}
   200.0000.0000.0000.000 rdd.py:291(func)
   400.0000.0000.0000.000 {method 'write' of 'file' 
objects}
   200.0000.0000.0000.000 {_struct.pack}
   210.0000.0000.0000.000 {_struct.unpack}
   200.0000.0000.0000.000 {iter}
   200.0000.0000.0000.000 {method 'append' of 'list' 
objects}
   200.0000.0000.0000.000 {len}
   200.0000.0000.0000.000 {next}
   200.0000.0000.0000.000 {method 'disable' of 
'_lsprof.Profiler' objects}
```

The profiling is disabled by default, can be enabled by 
"spark.python.profile=true".

Also, users can dump the results into disks for future analysis, by 
"spark.python.profile.dump=path_to_dump"

PS: will update the docs later.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark profiler

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2351.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2351


commit 4b20494ce4e5e287a09fee5df5e0684711258627
Author: Davies Liu 
Date:   2014-09-11T00:51:28Z

add profile for python




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org