Re: Join DStream With Other Datasets

Sean Owen Mon, 19 Jan 2015 02:06:44 -0800

I don't think this has anything to do with transferring anything from
the driver, or per task. I'm talking about a singleton object in the
JVM that loads whatever you want from wherever you want and holds it
in memory once per JVM. That is, I do not think you have to use
broadcast, or even any Spark mechanism.


On Mon, Jan 19, 2015 at 2:35 AM, Ji ZHANG <zhangj...@gmail.com> wrote:
> Hi Sean,
>
> Thanks for your advice, a normal 'val' will suffice. But will it be
> serialized and transferred every batch and every partition? That's why
> broadcast exists, right?
>
> For now I'm going to use 'val', but I'm still looking for a broadcast-way
> solution.
>
>
> On Sun, Jan 18, 2015 at 5:36 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I think that this problem is not Spark-specific since you are simply side
>> loading some data into memory. Therefore you do not need an answer that uses
>> Spark.
>>
>> Simply load the data and then poll for an update each time it is accessed?
>> Or some reasonable interval? This is just something you write in Java/Scala.
>>
>> On Jan 17, 2015 2:06 PM, "Ji ZHANG" <zhangj...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I want to join a DStream with some other dataset, e.g. join a click
>>> stream with a spam ip list. I can think of two possible solutions, one
>>> is use broadcast variable, and the other is use transform operation as
>>> is described in the manual.
>>>
>>> But the problem is the spam ip list will be updated outside of the
>>> spark streaming program, so how can it be noticed to reload the list?
>>>
>>> For broadcast variables, they are immutable.
>>>
>>> For transform operation, is it costly to reload the RDD on every
>>> batch? If it is, and I use RDD.persist(), does it mean I need to
>>> launch a thread to regularly unpersist it so that it can get the
>>> updates?
>>>
>>> Any ideas will be appreciated. Thanks.
>>>
>>> --
>>> Jerry
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>
>
>
> --
> Jerry

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Join DStream With Other Datasets

Reply via email to