[ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364355#comment-14364355
 ] 

Sean Owen commented on SPARK-6370:
----------------------------------

What's the bug? Each element is sampled with probability 0.5. I think the
expected size is 14 but not all samples would be that size.



> RDD sampling with replacement intermittently yields incorrect number of 
> samples
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-6370
>                 URL: https://issues.apache.org/jira/browse/SPARK-6370
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.0, 1.2.1
>         Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
>            Reporter: Marko Bonaci
>              Labels: PoissonSampler, sample, sampler
>
> Here's the repl output:
> {{code:java}}
> scala> uniqueIds.collect
> res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
> 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
> at <console>:27
> scala> swr.count
> res17: Long = 16
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
> at <console>:27
> scala> swr.count
> res18: Long = 8
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
> at <console>:27
> scala> swr.count
> res19: Long = 18
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
> at <console>:27
> scala> swr.count
> res20: Long = 15
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
> at <console>:27
> scala> swr.count
> res21: Long = 11
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
> at <console>:27
> scala> swr.count
> res22: Long = 10
> {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to