Hi,
I quoted the description of `sampleByKeyExact`:
"This method differs from [[sampleByKey]] in that we make additional passes
over the RDD to
create a sample size that's exactly equal to the sum of math.ceil(numItems *
samplingRate)
over all key values with a 99.99% confidence. When sampling without
replacement, we need one
additional pass over the RDD to guarantee sample size; when sampling with
replacement, we need
two additional passes."
As you see, `sampleByKeyExact` needs additional passes over the RDD to make
sure returning correctly sample size.
If you don't need that, you can try `sampleByKey` which is also doing
stratified sampling without strict requirement of the correctness of the
sample size.
Martin Le wrote
> Hi all,
>
> I perform sampling on a DStream by taking samples from RDDs in the
> DStream.
> I have used two sampling mechanisms: simple random sampling and stratified
> sampling.
>
> Simple random sampling: inputStream.transform(x => x.sample(false,
> fraction)).
>
> Stratified sampling: inputStream.transform(x => x.sampleByKeyExact(false,
> fractions))
>
> where fractions = Map(“key1”-> fraction, “key2”-> fraction, …, “keyn”->
> fraction).
>
> I have a question is that why stratified sampling scales poorly with
> different sampling fractions in this context? meanwhile simple random
> sampling scales well with different sampling fractions (I ran experiments
> on 4 nodes cluster )?
>
> Thank you,
>
> Martin
-
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/stratified-sampling-scales-poorly-tp20278p20337.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org