[GitHub] spark pull request: Reservoir sampling implementation.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1478 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49471598 Merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49412218 QA results for PR 1478:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16820/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49404657 QA tests have started for PR 1478. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16820/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49404318 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49404149 QA results for PR 1478:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16819/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49403909 QA tests have started for PR 1478. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16819/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1478#discussion_r15099390 --- Diff: core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala --- @@ -17,9 +17,49 @@ package org.apache.spark.util.random +import scala.reflect.ClassTag + private[spark] object SamplingUtils { /** + * Reservoir sampling implementation that also returns the input size. + * + * @param input input size + * @param k reservoir size + * @return (samples, input size) + */ + def reservoirSampleAndCount[T: ClassTag](input: Iterator[T], k: Int): (Array[T], Int) = { +val reservoir = new Array[T](k) +// Put the first k elements in the reservoir. +var i = 0 +while (i < k && input.hasNext) { + val item = input.next() + reservoir(i) = item + i += 1 +} + +// If we have consumed all the elements, return them. Otherwise do the replacement. +if (i < k) { + // If input size < k, trim the array to return only an array of input size. + val trimReservoir = new Array[T](i) + System.arraycopy(reservoir, 0, trimReservoir, 0, i) + (trimReservoir, i) +} else { + // If input size > k, continue the sampling process. + val rand = new XORShiftRandom --- End diff -- Please use a deterministic seed passed as an argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49401911 QA results for PR 1478:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16810/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49400124 QA results for PR 1478:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16808/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49396936 QA tests have started for PR 1478. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16810/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1478#discussion_r15096878 --- Diff: core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala --- @@ -17,9 +17,49 @@ package org.apache.spark.util.random +import scala.reflect.ClassTag + private[spark] object SamplingUtils { /** + * Reservoir Sampling implementation. + * + * @param input input size + * @param k reservoir size + * @return (samples, input size) + */ + def reservoirSample[T: ClassTag](input: Iterator[T], k: Int): (Array[T], Int) = { --- End diff -- `reservoirSampleAndCount`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1478#issuecomment-49395696 QA tests have started for PR 1478. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16808/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reservoir sampling implementation.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1478 Reservoir sampling implementation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark reservoirSample Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1478.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1478 commit 69400105159f64f0672da896c313e2a22525d219 Author: Reynold Xin Date: 2014-07-18T05:00:13Z Reservoir sampling implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---