Hi, I quoted the description of `sampleByKeyExact`:
"This method differs from [[sampleByKey]] in that we make additional passes over the RDD to create a sample size that's exactly equal to the sum of math.ceil(numItems * samplingRate) over all key values with a 99.99% confidence. When sampling without replacement, we need one additional pass over the RDD to guarantee sample size; when sampling with replacement, we need two additional passes." As you see, `sampleByKeyExact` needs additional passes over the RDD to make sure returning correctly sample size. If you don't need that, you can try `sampleByKey` which is also doing stratified sampling without strict requirement of the correctness of the sample size. Martin Le wrote > Hi all, > > I perform sampling on a DStream by taking samples from RDDs in the > DStream. > I have used two sampling mechanisms: simple random sampling and stratified > sampling. > > Simple random sampling: inputStream.transform(x => x.sample(false, > fraction)). > > Stratified sampling: inputStream.transform(x => x.sampleByKeyExact(false, > fractions)) > > where fractions = Map(“key1”-> fraction, “key2”-> fraction, …, “keyn”-> > fraction). > > I have a question is that why stratified sampling scales poorly with > different sampling fractions in this context? meanwhile simple random > sampling scales well with different sampling fractions (I ran experiments > on 4 nodes cluster )? > > Thank you, > > Martin ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/stratified-sampling-scales-poorly-tp20278p20337.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org