stratified sampling scales poorly

2016-12-19 Thread Martin Le
Hi all, I perform sampling on a DStream by taking samples from RDDs in the DStream. I have used two sampling mechanisms: simple random sampling and stratified sampling. Simple random sampling: inputStream.transform(x => x.sample(false, fraction)). Stratified sampling: inputStream.transform(x =>

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
How to do that? if I put the queue inside .transform operation, it doesn't work. On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote: > Can you keep a queue per executor in memory? > > On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <martin.leq...@gmail.com>

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
y were > evenly balanced. > > But once you've read the messages, nothing's stopping you from > filtering most of them out before doing further processing. The > dstream .transform method will let you do any filtering / sampling you > could have done on an rdd. > > On

sampling operation for DStream

2016-07-29 Thread Martin Le
Hi all, I have to handle high-speed rate data stream. To reduce the heavy load, I want to use sampling techniques for each stream window. It means that I want to process a subset of data instead of whole window data. I saw Spark support sampling operations for RDD, but for DStream, Spark supports