[ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364859#comment-14364859
 ] 

Sean Owen edited comment on SPARK-6370 at 3/17/15 10:09 AM:
------------------------------------------------------------

Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply 
to me that it's the size of the sample as a fraction of the total. In fact it's 
the probability that each element is chosen when "without replacement", and an 
expected number of times each element is chosen when "with replacement". EDIT: 
... which TBC in both cases also means that the *expected* size of the sample 
is the given fraction of the input size.

I think it needs a doc update. Would you like to open a PR to elaborate the 
javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to 
doc the {{RandomSampler}} subclasses too.


was (Author: srowen):
Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply 
to me that it's the size of the sample as a fraction of the total. In fact it's 
the probability that each element is chosen when "without replacement", and an 
expected number of times each element is chosen when "with replacement".

I think it needs a doc update. Would you like to open a PR to elaborate the 
javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to 
doc the {{RandomSampler}} subclasses too.

> RDD sampling with replacement intermittently yields incorrect number of 
> samples
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-6370
>                 URL: https://issues.apache.org/jira/browse/SPARK-6370
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.0, 1.2.1
>         Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
>            Reporter: Marko Bonaci
>              Labels: PoissonSampler, sample, sampler
>
> Here's the repl output:
> {{code:java}}
> scala> uniqueIds.collect
> res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
> 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
> at <console>:27
> scala> swr.count
> res17: Long = 16
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
> at <console>:27
> scala> swr.count
> res18: Long = 8
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
> at <console>:27
> scala> swr.count
> res19: Long = 18
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
> at <console>:27
> scala> swr.count
> res20: Long = 15
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
> at <console>:27
> scala> swr.count
> res21: Long = 11
> scala> val swr = uniqueIds.sample(true, 0.5)
> swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
> at <console>:27
> scala> swr.count
> res22: Long = 10
> {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to