[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364355#comment-14364355 ]
Sean Owen commented on SPARK-6370: ---------------------------------- What's the bug? Each element is sampled with probability 0.5. I think the expected size is 14 but not all samples would be that size. > RDD sampling with replacement intermittently yields incorrect number of > samples > ------------------------------------------------------------------------------- > > Key: SPARK-6370 > URL: https://issues.apache.org/jira/browse/SPARK-6370 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.3.0, 1.2.1 > Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 > Reporter: Marko Bonaci > Labels: PoissonSampler, sample, sampler > > Here's the repl output: > {{code:java}} > scala> uniqueIds.collect > res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, > 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample > at <console>:27 > scala> swr.count > res17: Long = 16 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample > at <console>:27 > scala> swr.count > res18: Long = 8 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample > at <console>:27 > scala> swr.count > res19: Long = 18 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample > at <console>:27 > scala> swr.count > res20: Long = 15 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample > at <console>:27 > scala> swr.count > res21: Long = 11 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample > at <console>:27 > scala> swr.count > res22: Long = 10 > {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org