[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8569 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-142190287 LGTM, will merge into master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-142189285 @davies this should now work in the other places which are using cogroup under the hood. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-139938712 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-139938713 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42389/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-139938669 [Test build #42389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42389/console) for PR 8569 at commit [`fe3ea4f`](https://github.com/apache/spark/commit/fe3ea4fca2b90ec8de2bfccbe02730e782c79447). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-139935515 [Test build #42389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42389/consoleFull) for PR 8569 at commit [`fe3ea4f`](https://github.com/apache/spark/commit/fe3ea4fca2b90ec8de2bfccbe02730e782c79447). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-139934680 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-139934674 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-139401041 @davies That sounds like a good plan, I'll expand the JIRA & this PR over the weekend and ping you when its done :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-139387340 @holdenk Almost all the APIs in PairRDDFunctions take an optional Partitioner, should we add this for all of them in Python? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-136968303 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41922/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-136968300 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-136968260 [Test build #41922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41922/console) for PR 8569 at commit [`8d272b3`](https://github.com/apache/spark/commit/8d272b3bf84a72c66c1529d2679d465038435f83). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-136962771 [Test build #41922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41922/consoleFull) for PR 8569 at commit [`8d272b3`](https://github.com/apache/spark/commit/8d272b3bf84a72c66c1529d2679d465038435f83). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-136961657 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8569#issuecomment-136961674 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9821][PYSPARK] pyspark-reduceByKey-shou...
GitHub user holdenk opened a pull request: https://github.com/apache/spark/pull/8569 [SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python. Here's an example of my code in Scala: weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_) But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so: weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect() But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better. You can merge this pull request into a Git repository by running: $ git pull https://github.com/holdenk/spark SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8569.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8569 commit 8d272b3bf84a72c66c1529d2679d465038435f83 Author: Holden Karau Date: 2015-09-02T07:27:48Z Add partitioner function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org