[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364859#comment-14364859 ]
Sean Owen edited comment on SPARK-6370 at 3/17/15 10:09 AM: ------------------------------------------------------------ Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply to me that it's the size of the sample as a fraction of the total. In fact it's the probability that each element is chosen when "without replacement", and an expected number of times each element is chosen when "with replacement". EDIT: ... which TBC in both cases also means that the *expected* size of the sample is the given fraction of the input size. I think it needs a doc update. Would you like to open a PR to elaborate the javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to doc the {{RandomSampler}} subclasses too. was (Author: srowen): Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply to me that it's the size of the sample as a fraction of the total. In fact it's the probability that each element is chosen when "without replacement", and an expected number of times each element is chosen when "with replacement". I think it needs a doc update. Would you like to open a PR to elaborate the javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to doc the {{RandomSampler}} subclasses too. > RDD sampling with replacement intermittently yields incorrect number of > samples > ------------------------------------------------------------------------------- > > Key: SPARK-6370 > URL: https://issues.apache.org/jira/browse/SPARK-6370 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.3.0, 1.2.1 > Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 > Reporter: Marko Bonaci > Labels: PoissonSampler, sample, sampler > > Here's the repl output: > {{code:java}} > scala> uniqueIds.collect > res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, > 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample > at <console>:27 > scala> swr.count > res17: Long = 16 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample > at <console>:27 > scala> swr.count > res18: Long = 8 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample > at <console>:27 > scala> swr.count > res19: Long = 18 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample > at <console>:27 > scala> swr.count > res20: Long = 15 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample > at <console>:27 > scala> swr.count > res21: Long = 11 > scala> val swr = uniqueIds.sample(true, 0.5) > swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample > at <console>:27 > scala> swr.count > res22: Long = 10 > {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org