Re: Broadcast variables can be rebroadcast?

Imran Rashid Tue, 19 May 2015 06:27:08 -0700

hmm, I guess it depends on the way you look at it.  In a way, I'm saying
that spark does *not* have any built in "auto-re-broadcast" if you try to
mutate a broadcast variable.  Instead, you should create something new, and
just broadcast it separately.  Then just have all the code you have
operating on your RDDs look at the new broadcast variable.


But I guess there is another way to look at it -- you are creating new
broadcast variables each time, but they all point to the same underlying
mutable data structure.  So in a way, you are "rebroadcasting" the same
underlying data structure.

Let me expand my example from earlier a little bit more:


def oneIteration(myRDD: RDD[...], myBroadcastVar: Broadcast[...]): Unit = {
 ...
}

// this is a val, because the data structure itself is mutable
val myMutableDataStructue = ...
// this is a var, because you will create new broadcasts
var myBroadcast = sc.broadcast(myMutableDataStructure)
(0 to 20).foreach { iteration =>
  oneIteration(myRDD, myBroadcast)
  // update your mutable data structure in place
  myMutableDataStructure.update(...)
  // ... but that doesn't effect the broadcast variables living out on the
cluster, so we need to
  // create a new one

  // this line is not required -- the broadcast var will automatically get
unpersisted when a gc
  // cleans up the old broadcast on the driver, but I'm including this here
for completeness,
  // in case you want to more proactively clean up old blocks if you are
low on space
  myBroadcast.unpersist()

  // now we create a new broadcast which has the updated data in our
mutable data structure
  myBroadcast = sc.broadcast(myMutableDataStructure)
}


hope this clarifies things!

Imran

On Tue, May 19, 2015 at 3:06 AM, N B <nb.nos...@gmail.com> wrote:

> Hi Imran,
>
> If I understood you correctly, you are suggesting to simply call broadcast
> again from the driver program. This is exactly what I am hoping will work
> as I have the Broadcast data wrapped up and I am indeed (re)broadcasting
> the wrapper over again when the underlying data changes. However,
> documentation seems to suggest that one cannot re-broadcast. Is my
> understanding accurate?
>
> Thanks
> NB
>
>
> On Mon, May 18, 2015 at 6:24 PM, Imran Rashid <iras...@cloudera.com>
> wrote:
>
>> Rather than "updating" the broadcast variable, can't you simply create a
>> new one?  When the old one can be gc'ed in your program, it will also get
>> gc'ed from spark's cache (and all executors).
>>
>> I think this will make your code *slightly* more complicated, as you need
>> to add in another layer of indirection for which broadcast variable to use,
>> but not too bad.  Eg., from
>>
>> var myBroadcast = sc.broadcast( ...)
>> (0 to 20).foreach{ iteration =>
>>   //  ... some rdd operations that involve myBroadcast ...
>>   myBroadcast.update(...) // wrong! dont' update a broadcast variable
>> }
>>
>> instead do something like:
>>
>> def oneIteration(myRDD: RDD[...], myBroadcastVar: Broadcast[...]): Unit =
>> {
>>  ...
>> }
>>
>> var myBroadcast = sc.broadcast(...)
>> (0 to 20).foreach { iteration =>
>>   oneIteration(myRDD, myBroadcast)
>>   var myBroadcast = sc.broadcast(...) // create a NEW broadcast here,
>> with whatever you need to update it
>> }
>>
>> On Sat, May 16, 2015 at 2:01 AM, N B <nb.nos...@gmail.com> wrote:
>>
>>> Thanks Ayan. Can we rebroadcast after updating in the driver?
>>>
>>> Thanks
>>> NB.
>>>
>>>
>>> On Fri, May 15, 2015 at 6:40 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> broadcast variables are shipped for the first time it is accessed in a
>>>> transformation to the executors used by the transformation. It will NOT
>>>> updated subsequently, even if the value has changed. However, a new value
>>>> will be shipped to any new executor comes into play after the value has
>>>> changed. This way, changing value of broadcast variable is not a good idea
>>>> as it can create inconsistency within cluster. From documentatins:
>>>>
>>>>  In addition, the object v should not be modified after it is
>>>> broadcast in order to ensure that all nodes get the same value of the
>>>> broadcast variable
>>>>
>>>>
>>>> On Sat, May 16, 2015 at 10:39 AM, N B <nb.nos...@gmail.com> wrote:
>>>>
>>>>> Thanks Ilya. Does one have to call broadcast again once the underlying
>>>>> data is updated in order to get the changes visible on all nodes?
>>>>>
>>>>> Thanks
>>>>> NB
>>>>>
>>>>>
>>>>> On Fri, May 15, 2015 at 5:29 PM, Ilya Ganelin <ilgan...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The broadcast variable is like a pointer. If the underlying data
>>>>>> changes then the changes will be visible throughout the cluster.
>>>>>> On Fri, May 15, 2015 at 5:18 PM NB <nb.nos...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Once a broadcast variable is created using sparkContext.broadcast(),
>>>>>>> can it
>>>>>>> ever be updated again? The use case is for something like the
>>>>>>> underlying
>>>>>>> lookup data changing over time.
>>>>>>>
>>>>>>> Thanks
>>>>>>> NB
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-can-be-rebroadcast-tp22908.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>

Re: Broadcast variables can be rebroadcast?

Reply via email to