[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8569


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-21 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-142190287
  
LGTM, will merge into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-21 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-142189285
  
@davies this should now work in the other places which are using cogroup 
under the hood.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-139938712
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-139938713
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42389/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-139938669
  
  [Test build #42389 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42389/console)
 for   PR 8569 at commit 
[`fe3ea4f`](https://github.com/apache/spark/commit/fe3ea4fca2b90ec8de2bfccbe02730e782c79447).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-139935515
  
  [Test build #42389 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42389/consoleFull)
 for   PR 8569 at commit 
[`fe3ea4f`](https://github.com/apache/spark/commit/fe3ea4fca2b90ec8de2bfccbe02730e782c79447).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-139934680
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-139934674
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-10 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-139401041
  
@davies That sounds like a good plan, I'll expand the JIRA & this PR over 
the weekend and ping you when its done :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-10 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-139387340
  
@holdenk Almost all the APIs in PairRDDFunctions take an optional 
Partitioner, should we add this for all of them in Python? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-136968303
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41922/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-136968300
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-136968260
  
  [Test build #41922 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41922/console)
 for   PR 8569 at commit 
[`8d272b3`](https://github.com/apache/spark/commit/8d272b3bf84a72c66c1529d2679d465038435f83).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-136962771
  
  [Test build #41922 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41922/consoleFull)
 for   PR 8569 at commit 
[`8d272b3`](https://github.com/apache/spark/commit/8d272b3bf84a72c66c1529d2679d465038435f83).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-136961657
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8569#issuecomment-136961674
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...

2015-09-02 Thread holdenk
GitHub user holdenk opened a pull request:

https://github.com/apache/spark/pull/8569

[SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner

In Scala, I can supply a custom partitioner to reduceByKey (and other 
aggregation/repartitioning methods like aggregateByKey and combinedByKey), but 
as far as I can tell from the Pyspark API, there's no way to do the same in 
Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new 
FileTypePartitioner(),_+_)
But I can't figure out how to do the same in Python. The closest I can get 
is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 
1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, 
so my performance is worse instead of better.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/holdenk/spark 
SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8569.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8569


commit 8d272b3bf84a72c66c1529d2679d465038435f83
Author: Holden Karau 
Date:   2015-09-02T07:27:48Z

Add partitioner function




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org