[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1478


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49471598
  
Merging in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49412218
  
QA results for PR 1478:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16820/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49404657
  
QA tests have started for PR 1478. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16820/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49404318
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49404149
  
QA results for PR 1478:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16819/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49403909
  
QA tests have started for PR 1478. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16819/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1478#discussion_r15099390
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala ---
@@ -17,9 +17,49 @@
 
 package org.apache.spark.util.random
 
+import scala.reflect.ClassTag
+
 private[spark] object SamplingUtils {
 
   /**
+   * Reservoir sampling implementation that also returns the input size.
+   *
+   * @param input input size
+   * @param k reservoir size
+   * @return (samples, input size)
+   */
+  def reservoirSampleAndCount[T: ClassTag](input: Iterator[T], k: Int): 
(Array[T], Int) = {
+val reservoir = new Array[T](k)
+// Put the first k elements in the reservoir.
+var i = 0
+while (i < k && input.hasNext) {
+  val item = input.next()
+  reservoir(i) = item
+  i += 1
+}
+
+// If we have consumed all the elements, return them. Otherwise do the 
replacement.
+if (i < k) {
+  // If input size < k, trim the array to return only an array of 
input size.
+  val trimReservoir = new Array[T](i)
+  System.arraycopy(reservoir, 0, trimReservoir, 0, i)
+  (trimReservoir, i)
+} else {
+  // If input size > k, continue the sampling process.
+  val rand = new XORShiftRandom
--- End diff --

Please use a deterministic seed passed as an argument.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49401911
  
QA results for PR 1478:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16810/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49400124
  
QA results for PR 1478:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16808/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49396936
  
QA tests have started for PR 1478. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16810/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1478#discussion_r15096878
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala ---
@@ -17,9 +17,49 @@
 
 package org.apache.spark.util.random
 
+import scala.reflect.ClassTag
+
 private[spark] object SamplingUtils {
 
   /**
+   * Reservoir Sampling implementation.
+   *
+   * @param input input size
+   * @param k reservoir size
+   * @return (samples, input size)
+   */
+  def reservoirSample[T: ClassTag](input: Iterator[T], k: Int): (Array[T], 
Int) = {
--- End diff --

`reservoirSampleAndCount`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1478#issuecomment-49395696
  
QA tests have started for PR 1478. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16808/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reservoir sampling implementation.

2014-07-17 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1478

Reservoir sampling implementation.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark reservoirSample

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1478.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1478


commit 69400105159f64f0672da896c313e2a22525d219
Author: Reynold Xin 
Date:   2014-07-18T05:00:13Z

Reservoir sampling implementation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---