Re: sampling operation for DStream

Cody Koeninger Mon, 01 Aug 2016 14:01:48 -0700

Put the queue in a static variable that is first referenced on the
workers (inside an rdd closure).  That way it will be created on each
of the workers, not the driver.


Easiest way to do that is with a lazy val in a companion object.

On Mon, Aug 1, 2016 at 3:22 PM, Martin Le <martin.leq...@gmail.com> wrote:
> How to do that? if I put the queue inside .transform operation, it doesn't
> work.
>
> On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Can you keep a queue per executor in memory?
>>
>> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <martin.leq...@gmail.com>
>> wrote:
>> > Hi Cody and all,
>> >
>> > Thank you for your answer. I implement simple random sampling (SRS) for
>> > DStream using transform method, and it works fine.
>> > However, I have a problem when I implement reservoir sampling (RS). In
>> > RS, I
>> > need to maintain a reservoir (a queue) to store selected data items
>> > (RDDs).
>> > If I define a large stream window, the queue also increases  and it
>> > leads to
>> > the driver run out of memory.  I explain my problem in detail here:
>> >
>> > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>> >
>> > Could you please give me some suggestions or advice to fix this problem?
>> >
>> > Thanks
>> >
>> > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org>
>> > wrote:
>> >>
>> >> Most stream systems you're still going to incur the cost of reading
>> >> each message... I suppose you could rotate among reading just the
>> >> latest messages from a single partition of a Kafka topic if they were
>> >> evenly balanced.
>> >>
>> >> But once you've read the messages, nothing's stopping you from
>> >> filtering most of them out before doing further processing.  The
>> >> dstream .transform method will let you do any filtering / sampling you
>> >> could have done on an rdd.
>> >>
>> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <martin.leq...@gmail.com>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have to handle high-speed rate data stream. To reduce the heavy
>> >> > load,
>> >> > I
>> >> > want to use sampling techniques for each stream window. It means that
>> >> > I
>> >> > want
>> >> > to process a subset of data instead of whole window data. I saw Spark
>> >> > support sampling operations for RDD, but for DStream, Spark supports
>> >> > sampling operation as well? If not,  could you please give me a
>> >> > suggestion
>> >> > how to implement it?
>> >> >
>> >> > Thanks,
>> >> > Martin
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: sampling operation for DStream

Reply via email to