Hi all, I perform sampling on a DStream by taking samples from RDDs in the DStream. I have used two sampling mechanisms: simple random sampling and stratified sampling.
Simple random sampling: inputStream.transform(x => x.sample(false, fraction)). Stratified sampling: inputStream.transform(x => x.sampleByKeyExact(false, fractions)) where fractions = Map(“key1”-> fraction, “key2”-> fraction, …, “keyn”-> fraction). I have a question is that why stratified sampling scales poorly with different sampling fractions in this context? meanwhile simple random sampling scales well with different sampling fractions (I ran experiments on 4 nodes cluster )? Thank you, Martin