Reservoir sampling in parallel

Patrick McCarthy Fri, 23 Feb 2018 07:45:19 -0800

I have a large dataset composed of scores for several thousand segments,
and the timestamps at which time those scores occurred. I'd like to apply
some techniques like reservoir sampling[1], where for every segment I
process records in order of their timestamps, generate a sample, and then
at intervals compute the quantiles in the sample. Ideally I'd like to write
a pyspark udf to do the sampling/quantizing procedure.


It seems like something I should be doing via rdd.map, but it's not really
clear how I can enforce a function to process records in order within a
partition. Any pointers?

Thanks,
Patrick

[1] https://en.wikipedia.org/wiki/Reservoir_sampling

Reservoir sampling in parallel

Reply via email to