[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-21 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59991158
  
Ok I'm merging this thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2839


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59688582
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21905/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59688580
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21905/consoleFull)**
 for PR 2839 at commit 
[`d6fdb2a`](https://github.com/apache/spark/commit/d6fdb2a40d8fbdcfadf3b27bc82e0bbdbdc808fe)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59682641
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21905/consoleFull)
 for   PR 2839 at commit 
[`d6fdb2a`](https://github.com/apache/spark/commit/d6fdb2a40d8fbdcfadf3b27bc82e0bbdbdc808fe).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59813859
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59814936
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21925/consoleFull)
 for   PR 2839 at commit 
[`d6fdb2a`](https://github.com/apache/spark/commit/d6fdb2a40d8fbdcfadf3b27bc82e0bbdbdc808fe).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59822350
  
LGTM provided it passes tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59822749
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21925/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59822742
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21925/consoleFull)
 for   PR 2839 at commit 
[`d6fdb2a`](https://github.com/apache/spark/commit/d6fdb2a40d8fbdcfadf3b27bc82e0bbdbdc808fe).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59838812
  
Jenkins, retest this please.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59866818
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59867118
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21949/consoleFull)
 for   PR 2839 at commit 
[`d6fdb2a`](https://github.com/apache/spark/commit/d6fdb2a40d8fbdcfadf3b27bc82e0bbdbdc808fe).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59871491
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21949/consoleFull)
 for   PR 2839 at commit 
[`d6fdb2a`](https://github.com/apache/spark/commit/d6fdb2a40d8fbdcfadf3b27bc82e0bbdbdc808fe).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59871497
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21949/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/2839

[SPARK-3994] Use standard Aggregator code path for countByKey and 
countByValue

See [JIRA](https://issues.apache.org/jira/browse/SPARK-3994) for more 
information. Also adds
a note which warns against using these methods.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark countByKey

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2839.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2839


commit e1f06d3c550e5419bb3b20745da7cbfdedbb2841
Author: Aaron Davidson aa...@databricks.com
Date:   2014-10-17T22:11:25Z

[SPARK-3994] Use standard Aggregator code path for countByKey and 
countByValue

See [JIRA](https://issues.apache.org/jira/browse/SPARK-3994) for more 
information. Also adds
a note which warns against using these methods.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59583927
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21867/consoleFull)
 for   PR 2839 at commit 
[`e1f06d3`](https://github.com/apache/spark/commit/e1f06d3c550e5419bb3b20745da7cbfdedbb2841).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/2839#discussion_r19046865
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -911,32 +911,15 @@ abstract class RDD[T: ClassTag](
   }
 
   /**
-   * Return the count of each unique value in this RDD as a map of (value, 
count) pairs. The final
-   * combine step happens locally on the master, equivalent to running a 
single reduce task.
+   * Return the count of each unique value in this RDD as a local map of 
(value, count) pairs.
+   *
+   * Note that this method should only be used if the resulting map is 
expected to be small, as
+   * the whole thing is loaded into the driver's memory.
+   * To handle very large results, consider using rdd.map(x = (x, 
1)).reduceByKey(_ + _), which
+   * returns an RDD[T, Long] instead of a map.
*/
   def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = {
-if (elementClassTag.runtimeClass.isArray) {
-  throw new SparkException(countByValue() does not support arrays)
--- End diff --

Note that we still don't support arrays, but this is caught by combineByKey:
`org.apache.spark.SparkException: Cannot use map-side combining with array 
keys.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59584036
  
@pwendell If you get a chance, PTAL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59591794
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21867/consoleFull)**
 for PR 2839 at commit 
[`e1f06d3`](https://github.com/apache/spark/commit/e1f06d3c550e5419bb3b20745da7cbfdedbb2841)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59591796
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21867/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59591925
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59592259
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21869/consoleFull)
 for   PR 2839 at commit 
[`e1f06d3`](https://github.com/apache/spark/commit/e1f06d3c550e5419bb3b20745da7cbfdedbb2841).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59594858
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21869/consoleFull)
 for   PR 2839 at commit 
[`e1f06d3`](https://github.com/apache/spark/commit/e1f06d3c550e5419bb3b20745da7cbfdedbb2841).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59594861
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21869/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2839#discussion_r19051173
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -238,8 +238,15 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   @deprecated(Use reduceByKeyLocally, 1.0.0)
   def reduceByKeyToDriver(func: (V, V) = V): Map[K, V] = 
reduceByKeyLocally(func)
 
-  /** Count the number of elements for each key, and return the result to 
the master as a Map. */
-  def countByKey(): Map[K, Long] = self.map(_._1).countByValue()
+  /** 
+   * Count the number of elements for each key, collecting the results to 
a local Map.
+   *
+   * Note that this method should only be used if the resulting map is 
expected to be small, as
+   * the whole thing is loaded into the driver's memory.
+   * To handle very large results, consider using rdd.mapValues(_ = 
1).reduceByKey(_ + _), which
--- End diff --

1L


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2839#discussion_r19051176
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -911,32 +911,15 @@ abstract class RDD[T: ClassTag](
   }
 
   /**
-   * Return the count of each unique value in this RDD as a map of (value, 
count) pairs. The final
-   * combine step happens locally on the master, equivalent to running a 
single reduce task.
+   * Return the count of each unique value in this RDD as a local map of 
(value, count) pairs.
+   *
+   * Note that this method should only be used if the resulting map is 
expected to be small, as
+   * the whole thing is loaded into the driver's memory.
+   * To handle very large results, consider using rdd.map(x = (x, 
1)).reduceByKey(_ + _), which
--- End diff --

1L


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2839#discussion_r19051180
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -238,8 +238,15 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   @deprecated(Use reduceByKeyLocally, 1.0.0)
   def reduceByKeyToDriver(func: (V, V) = V): Map[K, V] = 
reduceByKeyLocally(func)
 
-  /** Count the number of elements for each key, and return the result to 
the master as a Map. */
-  def countByKey(): Map[K, Long] = self.map(_._1).countByValue()
+  /** 
+   * Count the number of elements for each key, collecting the results to 
a local Map.
+   *
+   * Note that this method should only be used if the resulting map is 
expected to be small, as
+   * the whole thing is loaded into the driver's memory.
+   * To handle very large results, consider using rdd.mapValues(_ = 
1).reduceByKey(_ + _), which
+   * returns an RDD[T, Long] instead of a map.
+   */
+  def countByKey(): Map[K, Long] = self.mapValues(_ = 1L).reduceByKey(_ + 
_).collect.toMap
--- End diff --

think you need `collect()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3994] Use standard Aggregator code path...

2014-10-17 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2839#issuecomment-59595776
  
LGTM. I think @mateiz wrote the original code so maybe he can take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org