[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin closed the pull request at: https://github.com/apache/spark/pull/1469 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1469#issuecomment-49379863 Merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1469#issuecomment-49376423 QA results for PR 1469:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16791/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1469#issuecomment-49373309 QA tests have started for PR 1469. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16791/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1469#issuecomment-49373136 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1469#issuecomment-49353170 QA tests have started for PR 1469. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16787/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1469 [SPARK-2534] Avoid pulling in the entire RDD in various operators (branch-1.0 backport) This backports #1450 into branch-1.0. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark closure-1.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1469.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1469 commit b474a92d0cf051c6dc67ddfcc7423427ccd69020 Author: Reynold Xin Date: 2014-07-17T19:25:56Z [SPARK-2534] Avoid pulling in the entire RDD in various operators --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1450 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49350307 Merged in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49276870 QA results for PR 1450:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16770/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49269144 I created a JIRA to deal with this and did some initial exploration, but I think I'll need to wait for Prashant to actually do it: https://issues.apache.org/jira/browse/SPARK-2549 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49266828 QA tests have started for PR 1450. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16770/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1450#discussion_r15043414 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException("reduceByKeyLocally() does not support array keys") } -def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { +val reducePartition = (iter: Iterator[(K, V)]) => { --- End diff -- this is fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49261722 Eh the binary checker is really failing me. Is there a way to disable binary checker for inner functions? @pwendell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49261240 QA results for PR 1450:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16765/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49256830 QA tests have started for PR 1450. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16765/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1450#discussion_r15040336 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException("reduceByKeyLocally() does not support array keys") } -def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { +val reducePartition = (iter: Iterator[(K, V)]) => { --- End diff -- That makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1450#discussion_r15040327 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException("reduceByKeyLocally() does not support array keys") } -def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { +val reducePartition = (iter: Iterator[(K, V)]) => { --- End diff -- And when I said non-obvious, I mean just from looking at the function name and input arguments. Here it is actually straightforward to infer from the remaining lines, but in other situations it is less so. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1450#discussion_r15040306 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) throw new SparkException("reduceByKeyLocally() does not support array keys") } -def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = { +val reducePartition = (iter: Iterator[(K, V)]) => { --- End diff -- I have to push back on the loss of the return type here, since I don't think it's obvious. I know it's kind of a pain to add the whole type specification, though... what would you think about putting a `: Iterator[JHashMap[K, V]]` after the final bracket? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49256304 Jenkins, retest this please. Flume streaming suite failed. I don't think it is relevant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49256041 QA results for PR 1450:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16762/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49251845 QA tests have started for PR 1450. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16762/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49251632 Pushed a new version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1450#discussion_r15038311 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -361,11 +361,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. -def createCombiner(v: V) = ArrayBuffer(v) --- End diff -- We should change all of them actually. I will update the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1450#discussion_r15038170 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -361,11 +361,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. -def createCombiner(v: V) = ArrayBuffer(v) --- End diff -- There appear to be ~6 other functions of this type (defs that may be passed into closures), could these also be problematic? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1450#issuecomment-49242499 Jenkins, why are you so slow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1450 [SPARK-2534] Avoid pulling in the entire RDD in groupByKey. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark agg-closure Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1450.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1450 commit 73b2783fef785941fc966ad32f2fd987b12447ae Author: Reynold Xin Date: 2014-07-16T23:34:34Z [SPARK-2534] Avoid pulling in the entire RDD in groupByKey. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---