I am pasting some of the exchanges I had on this topic via the mailing list directly so it may help someone else too. (Don't know why those responses don't show up here).
----------------------------------- Thanks Imran. It does help clarify. I believe I had it right all along then but was confused by documentation talking about never changing the broadcasted variables. I've tried it on a local mode process till now and does seem to work as intended. When (and if !) we start running on a real cluster, I hope this holds up. Thanks NB On Tue, May 19, 2015 at 6:25 AM, Imran Rashid <iras...@cloudera.com> wrote: > hmm, I guess it depends on the way you look at it. In a way, I'm saying > that spark does *not* have any built in "auto-re-broadcast" if you try to > mutate a broadcast variable. Instead, you should create something new, > and > just broadcast it separately. Then just have all the code you have > operating on your RDDs look at the new broadcast variable. > > But I guess there is another way to look at it -- you are creating new > broadcast variables each time, but they all point to the same underlying > mutable data structure. So in a way, you are "rebroadcasting" the same > underlying data structure. > > Let me expand my example from earlier a little bit more: > > > def oneIteration(myRDD: RDD[...], myBroadcastVar: Broadcast[...]): Unit = > { > ... > } > > // this is a val, because the data structure itself is mutable > val myMutableDataStructue = ... > // this is a var, because you will create new broadcasts > var myBroadcast = sc.broadcast(myMutableDataStructure) > (0 to 20).foreach { iteration => > oneIteration(myRDD, myBroadcast) > // update your mutable data structure in place > myMutableDataStructure.update(...) > // ... but that doesn't effect the broadcast variables living out on the > cluster, so we need to > // create a new one > > // this line is not required -- the broadcast var will automatically get > unpersisted when a gc > // cleans up the old broadcast on the driver, but I'm including this > here for completeness, > // in case you want to more proactively clean up old blocks if you are > low on space > myBroadcast.unpersist() > > // now we create a new broadcast which has the updated data in our > mutable data structure > myBroadcast = sc.broadcast(myMutableDataStructure) > } > > > hope this clarifies things! > > Imran > > On Tue, May 19, 2015 at 3:06 AM, N B <nb.nos...@gmail.com> wrote: > >> Hi Imran, >> >> If I understood you correctly, you are suggesting to simply call >> broadcast again from the driver program. This is exactly what I am hoping >> will work as I have the Broadcast data wrapped up and I am indeed >> (re)broadcasting the wrapper over again when the underlying data changes. >> However, documentation seems to suggest that one cannot re-broadcast. Is >> my >> understanding accurate? >> >> Thanks >> NB >> >> >> On Mon, May 18, 2015 at 6:24 PM, Imran Rashid <iras...@cloudera.com> >> wrote: >> >>> Rather than "updating" the broadcast variable, can't you simply create a >>> new one? When the old one can be gc'ed in your program, it will also >>> get >>> gc'ed from spark's cache (and all executors). >>> >>> I think this will make your code *slightly* more complicated, as you >>> need to add in another layer of indirection for which broadcast variable >>> to >>> use, but not too bad. Eg., from >>> >>> var myBroadcast = sc.broadcast( ...) >>> (0 to 20).foreach{ iteration => >>> // ... some rdd operations that involve myBroadcast ... >>> myBroadcast.update(...) // wrong! dont' update a broadcast variable >>> } >>> >>> instead do something like: >>> >>> def oneIteration(myRDD: RDD[...], myBroadcastVar: Broadcast[...]): Unit >>> = { >>> ... >>> } >>> >>> var myBroadcast = sc.broadcast(...) >>> (0 to 20).foreach { iteration => >>> oneIteration(myRDD, myBroadcast) >>> var myBroadcast = sc.broadcast(...) // create a NEW broadcast here, >>> with whatever you need to update it >>> } >>> >>> On Sat, May 16, 2015 at 2:01 AM, N B <nb.nos...@gmail.com> wrote: >>> >>>> Thanks Ayan. Can we rebroadcast after updating in the driver? >>>> >>>> Thanks >>>> NB. >>>> >>>> >>>> On Fri, May 15, 2015 at 6:40 PM, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>>> Hi >>>>> >>>>> broadcast variables are shipped for the first time it is accessed in a >>>>> transformation to the executors used by the transformation. It will >>>>> NOT >>>>> updated subsequently, even if the value has changed. However, a new >>>>> value >>>>> will be shipped to any new executor comes into play after the value >>>>> has >>>>> changed. This way, changing value of broadcast variable is not a good >>>>> idea >>>>> as it can create inconsistency within cluster. From documentatins: >>>>> >>>>> In addition, the object v should not be modified after it is >>>>> broadcast in order to ensure that all nodes get the same value of the >>>>> broadcast variable >>>>> >>>>> >>>>> On Sat, May 16, 2015 at 10:39 AM, N B <nb.nos...@gmail.com> wrote: >>>>> >>>>>> Thanks Ilya. Does one have to call broadcast again once the >>>>>> underlying data is updated in order to get the changes visible on all >>>>>> nodes? >>>>>> >>>>>> Thanks >>>>>> NB >>>>>> >>>>>> >>>>>> On Fri, May 15, 2015 at 5:29 PM, Ilya Ganelin <ilgan...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> The broadcast variable is like a pointer. If the underlying data >>>>>>> changes then the changes will be visible throughout the cluster. >>>>>>> On Fri, May 15, 2015 at 5:18 PM NB <nb.nos...@gmail.com> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> Once a broadcast variable is created using >>>>>>>> sparkContext.broadcast(), can it >>>>>>>> ever be updated again? The use case is for something like the >>>>>>>> underlying >>>>>>>> lookup data changing over time. >>>>>>>> >>>>>>>> Thanks >>>>>>>> NB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-can-be-rebroadcast-tp22908p23141.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org