Hi,

I quoted the description of `sampleByKeyExact`:

"This method differs from [[sampleByKey]] in that we make additional passes
over the RDD to
create a sample size that's exactly equal to the sum of math.ceil(numItems *
samplingRate)
over all key values with a 99.99% confidence. When sampling without
replacement, we need one
additional pass over the RDD to guarantee sample size; when sampling with
replacement, we need
two additional passes."

As you see, `sampleByKeyExact` needs additional passes over the RDD to make
sure returning correctly sample size.

If you don't need that, you can try `sampleByKey` which is also doing
stratified sampling without strict requirement of the correctness of  the
sample size.



Martin Le wrote
> Hi all,
> 
> I perform sampling on a DStream by taking samples from RDDs in the
> DStream.
> I have used two sampling mechanisms: simple random sampling and stratified
> sampling.
> 
> Simple random sampling: inputStream.transform(x => x.sample(false,
> fraction)).
> 
> Stratified sampling: inputStream.transform(x => x.sampleByKeyExact(false,
> fractions))
> 
> where fractions = Map(“key1”-> fraction,  “key2”-> fraction, …, “keyn”->
> fraction).
> 
> I have a question is that why stratified sampling scales poorly with
> different sampling fractions in this context? meanwhile simple random
> sampling scales well with different sampling fractions (I ran experiments
> on 4 nodes cluster )?
> 
> Thank you,
> 
> Martin





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/stratified-sampling-scales-poorly-tp20278p20337.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to