I am pasting some of the exchanges I had on this topic via the mailing list
directly so it may help someone else too. (Don't know why those responses
don't show up here).

-----------------------------------

Thanks Imran. It does help clarify. I believe I had it right all along then
but was confused by documentation talking about never changing the
broadcasted variables.

I've tried it on a local mode process till now and does seem to work as
intended. When (and if !) we start running on a real cluster, I hope this
holds up.

Thanks
NB


On Tue, May 19, 2015 at 6:25 AM, Imran Rashid <iras...@cloudera.com> wrote:

> hmm, I guess it depends on the way you look at it.  In a way, I'm saying
> that spark does *not* have any built in "auto-re-broadcast" if you try to
> mutate a broadcast variable.  Instead, you should create something new,
> and
> just broadcast it separately.  Then just have all the code you have
> operating on your RDDs look at the new broadcast variable.
>
> But I guess there is another way to look at it -- you are creating new
> broadcast variables each time, but they all point to the same underlying
> mutable data structure.  So in a way, you are "rebroadcasting" the same
> underlying data structure.
>
> Let me expand my example from earlier a little bit more:
>
>
> def oneIteration(myRDD: RDD[...], myBroadcastVar: Broadcast[...]): Unit =
> {
>  ...
> }
>
> // this is a val, because the data structure itself is mutable
> val myMutableDataStructue = ...
> // this is a var, because you will create new broadcasts
> var myBroadcast = sc.broadcast(myMutableDataStructure)
> (0 to 20).foreach { iteration =>
>   oneIteration(myRDD, myBroadcast)
>   // update your mutable data structure in place
>   myMutableDataStructure.update(...)
>   // ... but that doesn't effect the broadcast variables living out on the
> cluster, so we need to
>   // create a new one
>
>   // this line is not required -- the broadcast var will automatically get
> unpersisted when a gc
>   // cleans up the old broadcast on the driver, but I'm including this
> here for completeness,
>   // in case you want to more proactively clean up old blocks if you are
> low on space
>   myBroadcast.unpersist()
>
>   // now we create a new broadcast which has the updated data in our
> mutable data structure
>   myBroadcast = sc.broadcast(myMutableDataStructure)
> }
>
>
> hope this clarifies things!
>
> Imran
>
> On Tue, May 19, 2015 at 3:06 AM, N B <nb.nos...@gmail.com> wrote:
>
>> Hi Imran,
>>
>> If I understood you correctly, you are suggesting to simply call
>> broadcast again from the driver program. This is exactly what I am hoping
>> will work as I have the Broadcast data wrapped up and I am indeed
>> (re)broadcasting the wrapper over again when the underlying data changes.
>> However, documentation seems to suggest that one cannot re-broadcast. Is
>> my
>> understanding accurate?
>>
>> Thanks
>> NB
>>
>>
>> On Mon, May 18, 2015 at 6:24 PM, Imran Rashid <iras...@cloudera.com>
>> wrote:
>>
>>> Rather than "updating" the broadcast variable, can't you simply create a
>>> new one?  When the old one can be gc'ed in your program, it will also
>>> get
>>> gc'ed from spark's cache (and all executors).
>>>
>>> I think this will make your code *slightly* more complicated, as you
>>> need to add in another layer of indirection for which broadcast variable
>>> to
>>> use, but not too bad.  Eg., from
>>>
>>> var myBroadcast = sc.broadcast( ...)
>>> (0 to 20).foreach{ iteration =>
>>>   //  ... some rdd operations that involve myBroadcast ...
>>>   myBroadcast.update(...) // wrong! dont' update a broadcast variable
>>> }
>>>
>>> instead do something like:
>>>
>>> def oneIteration(myRDD: RDD[...], myBroadcastVar: Broadcast[...]): Unit
>>> = {
>>>  ...
>>> }
>>>
>>> var myBroadcast = sc.broadcast(...)
>>> (0 to 20).foreach { iteration =>
>>>   oneIteration(myRDD, myBroadcast)
>>>   var myBroadcast = sc.broadcast(...) // create a NEW broadcast here,
>>> with whatever you need to update it
>>> }
>>>
>>> On Sat, May 16, 2015 at 2:01 AM, N B <nb.nos...@gmail.com> wrote:
>>>
>>>> Thanks Ayan. Can we rebroadcast after updating in the driver?
>>>>
>>>> Thanks
>>>> NB.
>>>>
>>>>
>>>> On Fri, May 15, 2015 at 6:40 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> broadcast variables are shipped for the first time it is accessed in a
>>>>> transformation to the executors used by the transformation. It will
>>>>> NOT
>>>>> updated subsequently, even if the value has changed. However, a new
>>>>> value
>>>>> will be shipped to any new executor comes into play after the value
>>>>> has
>>>>> changed. This way, changing value of broadcast variable is not a good
>>>>> idea
>>>>> as it can create inconsistency within cluster. From documentatins:
>>>>>
>>>>>  In addition, the object v should not be modified after it is
>>>>> broadcast in order to ensure that all nodes get the same value of the
>>>>> broadcast variable
>>>>>
>>>>>
>>>>> On Sat, May 16, 2015 at 10:39 AM, N B <nb.nos...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Ilya. Does one have to call broadcast again once the
>>>>>> underlying data is updated in order to get the changes visible on all
>>>>>> nodes?
>>>>>>
>>>>>> Thanks
>>>>>> NB
>>>>>>
>>>>>>
>>>>>> On Fri, May 15, 2015 at 5:29 PM, Ilya Ganelin <ilgan...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The broadcast variable is like a pointer. If the underlying data
>>>>>>> changes then the changes will be visible throughout the cluster.
>>>>>>> On Fri, May 15, 2015 at 5:18 PM NB <nb.nos...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Once a broadcast variable is created using
>>>>>>>> sparkContext.broadcast(), can it
>>>>>>>> ever be updated again? The use case is for something like the
>>>>>>>> underlying
>>>>>>>> lookup data changing over time.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> NB




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-can-be-rebroadcast-tp22908p23141.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to