[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

Marko Bonaci (JIRA) Wed, 18 Mar 2015 23:40:20 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368599#comment-14368599
 ]


Marko Bonaci commented on SPARK-6370:
-------------------------------------

Before sending PR, would something like this be OK (feel free to improve it):

{code:java}
  /**
   * Return a sampled subset of this RDD.
   * 
   * @param withReplacement whether elements are returned back into the pool 
upon being sampled
   * @param fraction the __expected__ fraction of this RDD's size to be sampled
   *  without replacement: probability that each element is chosen;
   *  with replacement: expected number of times each element is chosen
   * @param seed seed for the random number generator
   */
  def sample(withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
{code}

> RDD sampling with replacement intermittently yields incorrect number of 
> samples
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-6370
>                 URL: https://issues.apache.org/jira/browse/SPARK-6370
>             Project: Spark
>          Issue Type: Documentation
>          Components: Spark Core
>    Affects Versions: 1.3.0, 1.2.1
>         Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
>            Reporter: Marko Bonaci
>            Priority: Minor
>              Labels: PoissonSampler, sample, sampler
>
> Here's the repl output:
> {{code:java}}
> scala> uniqueIds.collect
> res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
> 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
> at <console>:27
> scala> swr.count
> res17: Long = 16
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
> at <console>:27
> scala> swr.count
> res18: Long = 8
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
> at <console>:27
> scala> swr.count
> res19: Long = 18
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
> at <console>:27
> scala> swr.count
> res20: Long = 15
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
> at <console>:27
> scala> swr.count
> res21: Long = 11
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
> at <console>:27
> scala> swr.count
> res22: Long = 10
> {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

Reply via email to