Re: Join DStream With Other Datasets

Ji ZHANG Sun, 18 Jan 2015 18:36:56 -0800

Hi Sean,

Thanks for your advice, a normal 'val' will suffice. But will it be
serialized and transferred every batch and every partition? That's why
broadcast exists, right?


For now I'm going to use 'val', but I'm still looking for a broadcast-way
solution.


On Sun, Jan 18, 2015 at 5:36 PM, Sean Owen <so...@cloudera.com> wrote:

> I think that this problem is not Spark-specific since you are simply side
> loading some data into memory. Therefore you do not need an answer that
> uses Spark.
>
> Simply load the data and then poll for an update each time it is accessed?
> Or some reasonable interval? This is just something you write in Java/Scala.
> On Jan 17, 2015 2:06 PM, "Ji ZHANG" <zhangj...@gmail.com> wrote:
>
>> Hi,
>>
>> I want to join a DStream with some other dataset, e.g. join a click
>> stream with a spam ip list. I can think of two possible solutions, one
>> is use broadcast variable, and the other is use transform operation as
>> is described in the manual.
>>
>> But the problem is the spam ip list will be updated outside of the
>> spark streaming program, so how can it be noticed to reload the list?
>>
>> For broadcast variables, they are immutable.
>>
>> For transform operation, is it costly to reload the RDD on every
>> batch? If it is, and I use RDD.persist(), does it mean I need to
>> launch a thread to regularly unpersist it so that it can get the
>> updates?
>>
>> Any ideas will be appreciated. Thanks.
>>
>> --
>> Jerry
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


-- 
Jerry

Re: Join DStream With Other Datasets

Reply via email to