[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368599#comment-14368599 ]
Marko Bonaci commented on SPARK-6370: ------------------------------------- Before sending PR, would something like this be OK (feel free to improve it): {code:java} /** * Return a sampled subset of this RDD. * * @param withReplacement whether elements are returned back into the pool upon being sampled * @param fraction the __expected__ fraction of this RDD's size to be sampled * without replacement: probability that each element is chosen; * with replacement: expected number of times each element is chosen * @param seed seed for the random number generator */ def sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T] = { {code} > RDD sampling with replacement intermittently yields incorrect number of > samples > ------------------------------------------------------------------------------- > > Key: SPARK-6370 > URL: https://issues.apache.org/jira/browse/SPARK-6370 > Project: Spark > Issue Type: Documentation > Components: Spark Core > Affects Versions: 1.3.0, 1.2.1 > Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 > Reporter: Marko Bonaci > Priority: Minor > Labels: PoissonSampler, sample, sampler > > Here's the repl output: > {{code:java}} > scala> uniqueIds.collect > res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, > 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample > at <console>:27 > scala> swr.count > res17: Long = 16 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample > at <console>:27 > scala> swr.count > res18: Long = 8 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample > at <console>:27 > scala> swr.count > res19: Long = 18 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample > at <console>:27 > scala> swr.count > res20: Long = 15 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample > at <console>:27 > scala> swr.count > res21: Long = 11 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample > at <console>:27 > scala> swr.count > res22: Long = 10 > {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org