Hi all,

I perform sampling on a DStream by taking samples from RDDs in the DStream.
I have used two sampling mechanisms: simple random sampling and stratified
sampling.

Simple random sampling: inputStream.transform(x => x.sample(false,
fraction)).

Stratified sampling: inputStream.transform(x => x.sampleByKeyExact(false,
fractions))

where fractions = Map(“key1”-> fraction,  “key2”-> fraction, …, “keyn”->
fraction).

I have a question is that why stratified sampling scales poorly with
different sampling fractions in this context? meanwhile simple random
sampling scales well with different sampling fractions (I ran experiments
on 4 nodes cluster )?

Thank you,

Martin

Reply via email to