[GitHub] spark pull request: SPARK-4644 blockjoin
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-148917448 sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/6883 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-148828836 Given that we're unlikely to add this in core right now and given that it's available in a Spark Package, would you mind closing this PR for now in order to clean up the review backlog? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-133778896 i put this in a spark package together with skewjoin in case anyone wants to use it. see here: http://spark-packages.org/package/tresata/spark-skewjoin https://github.com/tresata/spark-skewjoin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-114354847 @rxin I remember you said you would like such improvement to be added to Spark SQL rather than Spark Core. What's your thoughts on this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113670872 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113670853 [Test build #35333 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35333/console) for PR 6883 at commit [`a5dd71c`](https://github.com/apache/spark/commit/a5dd71c4636aa7d4c1a3acb0755736c526d5b0df). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113660265 [Test build #35333 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35333/consoleFull) for PR 6883 at commit [`a5dd71c`](https://github.com/apache/spark/commit/a5dd71c4636aa7d4c1a3acb0755736c526d5b0df). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113660121 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113660077 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113630470 [Test build #35315 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35315/console) for PR 6883 at commit [`adef52e`](https://github.com/apache/spark/commit/adef52ed4c335980e73c61036abb2a2806965de3). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113630477 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113630110 [Test build #35315 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35315/consoleFull) for PR 6883 at commit [`adef52e`](https://github.com/apache/spark/commit/adef52ed4c335980e73c61036abb2a2806965de3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113629557 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113629566 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113284866 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113284862 [Test build #35168 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35168/console) for PR 6883 at commit [`6ac82cb`](https://github.com/apache/spark/commit/6ac82cb644d1c226b3fb4ea01fd122ca7b623a35). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113284473 @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113284346 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113284422 [Test build #35168 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35168/consoleFull) for PR 6883 at commit [`6ac82cb`](https://github.com/apache/spark/commit/6ac82cb644d1c226b3fb4ea01fd122ca7b623a35). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113284330 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6883#discussion_r32776196 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala --- @@ -515,6 +515,76 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) } /** + * Same as join, but uses a block join, otherwise known as a replicate fragment join. + * This is useful in cases where the data has extreme skew. + * The input params leftReplication and rightReplication control the replication of the left + * (this rdd) and right (other rdd) respectively. + */ + def blockJoin[W](other: JavaPairRDD[K, W], leftReplication: Int, rightReplication: Int, +partitioner: Partitioner): JavaPairRDD[K, (V, W)] = { +fromRDD(rdd.blockJoin(other, leftReplication, rightReplication, partitioner)) + } + + /** + * Same as join, but uses a block join, otherwise known as a replicate fragment join. + * This is useful in cases where the data has extreme skew. + * The input params leftReplication and rightReplication control the replication of the left + * (this rdd) and right (other rdd) respectively. + */ + def blockJoin[W](other: JavaPairRDD[K, W], leftReplication: Int, rightReplication: Int) + : JavaPairRDD[K, (V, W)] = { +fromRDD(rdd.blockJoin(other, leftReplication, rightReplication)) + } + + /** + * Same as leftOuterJoin, but uses a block join, otherwise known as a replicate fragment join. + * This is useful in cases where the data has extreme skew. + * The input param rightReplication controls the replication of the right (other rdd). + */ + def blockLeftOuterJoin[W](other: JavaPairRDD[K, W], rightReplication: Int, +partitioner: Partitioner): JavaPairRDD[K, (V, Optional[W])] = { +fromRDD(rdd.blockLeftOuterJoin(other, rightReplication, partitioner).mapValues{ case (v, w) => --- End diff -- need space after `.mapValues` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6883#discussion_r32776189 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala --- @@ -515,6 +515,76 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) } /** + * Same as join, but uses a block join, otherwise known as a replicate fragment join. + * This is useful in cases where the data has extreme skew. + * The input params leftReplication and rightReplication control the replication of the left + * (this rdd) and right (other rdd) respectively. + */ + def blockJoin[W](other: JavaPairRDD[K, W], leftReplication: Int, rightReplication: Int, +partitioner: Partitioner): JavaPairRDD[K, (V, W)] = { +fromRDD(rdd.blockJoin(other, leftReplication, rightReplication, partitioner)) + } + + /** + * Same as join, but uses a block join, otherwise known as a replicate fragment join. + * This is useful in cases where the data has extreme skew. + * The input params leftReplication and rightReplication control the replication of the left + * (this rdd) and right (other rdd) respectively. + */ + def blockJoin[W](other: JavaPairRDD[K, W], leftReplication: Int, rightReplication: Int) + : JavaPairRDD[K, (V, W)] = { +fromRDD(rdd.blockJoin(other, leftReplication, rightReplication)) + } + + /** + * Same as leftOuterJoin, but uses a block join, otherwise known as a replicate fragment join. + * This is useful in cases where the data has extreme skew. + * The input param rightReplication controls the replication of the right (other rdd). + */ + def blockLeftOuterJoin[W](other: JavaPairRDD[K, W], rightReplication: Int, +partitioner: Partitioner): JavaPairRDD[K, (V, Optional[W])] = { +fromRDD(rdd.blockLeftOuterJoin(other, rightReplication, partitioner).mapValues{ case (v, w) => + (v, JavaUtils.optionToOptional(w)) +}) --- End diff -- need space after `.mapValues` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6883#discussion_r32776123 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala --- @@ -515,6 +515,76 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) } /** + * Same as join, but uses a block join, otherwise known as a replicate fragment join. + * This is useful in cases where the data has extreme skew. + * The input params leftReplication and rightReplication control the replication of the left + * (this rdd) and right (other rdd) respectively. + */ + def blockJoin[W](other: JavaPairRDD[K, W], leftReplication: Int, rightReplication: Int, +partitioner: Partitioner): JavaPairRDD[K, (V, W)] = { --- End diff -- style: ``` def blockJoin[W]( other: JavaPairRDD[K, W], leftReplication: Int, ...): JavaPairRDD[...] = { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113283813 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/6883 SPARK-4644 blockjoin Although the discussion (and design doc) under SPARK-4644 seem focussed on other aspects of skew (OOM mostly) than this pullreq (which focusses on avoiding a single reducer taking a long time), i decided to put this pullreq under SPARK-4644 anyhow, to avoid the proliferation of JIRA tickets. If this is not the right place let me know and i will move it. Inspired by block join in scalding. From scalding docs: This is useful in cases where the data has extreme skew. A symptom of this is that we may see a job stuck for a very long time on a small number of reducers. A block join is way to get around this: we add a random integer field and a replica field to every tuple in the left and right pipes. We then join on the original keys and on these new dummy fields. These dummy fields make it less likely that the skewed keys will be hashed to the same reducer. The final data size is right * rightReplication + left * leftReplication but because of the fragmentation, we are guaranteed the same number of hits as the original join. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-blockjoin Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6883.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6883 commit 77d8fee6ad7ba5f83eb0c82b7f1625e2206a5446 Author: Koert Kuipers Date: 2015-06-17T20:35:18Z add blockJoin, blockLeftOuterJoin and blockRightOuterJoin to spark core commit d1fd3e020812c72c44a6461d9c94065e2784cdbb Author: Koert Kuipers Date: 2015-06-17T23:48:43Z correct scaladocs for block join functions commit 2114df748f62b53155d7db5524e163504cead228 Author: Koert Kuipers Date: 2015-06-18T03:36:21Z add block joins to java api --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4644 blockjoin
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-113178842 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org