I have a large dataset composed of scores for several thousand segments, and the timestamps at which time those scores occurred. I'd like to apply some techniques like reservoir sampling[1], where for every segment I process records in order of their timestamps, generate a sample, and then at intervals compute the quantiles in the sample. Ideally I'd like to write a pyspark udf to do the sampling/quantizing procedure.
It seems like something I should be doing via rdd.map, but it's not really clear how I can enforce a function to process records in order within a partition. Any pointers? Thanks, Patrick [1] https://en.wikipedia.org/wiki/Reservoir_sampling