Yeah, after destroy, accessing the broadcast variable results in an error. Accessing it after it's unpersisted (on an executor) causes it to be rebroadcast.
On Tue, Aug 30, 2016 at 5:12 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Sean, > > Thank you for sharing the knowledge between unpersist and destroy. > Does that mean unpersist keeps the broadcast variable in the driver whereas > destroy will delete everything about the broadcast variable like it has > never existed? > > Best Regards, > > Jerry > > > On Tue, Aug 30, 2016 at 11:58 AM, Sean Owen <so...@cloudera.com> wrote: >> >> Yes, although there's a difference between unpersist and destroy, >> you'll hit the same type of question either way. You do indeed have to >> reason about when you know the broadcast variable is no longer needed >> in the face of lazy evaluation, and that's hard. >> >> Sometimes it's obvious and you can take advantage of this to >> proactively free resources. You may have to consider restructuring the >> computation to allow for more resources to be freed, if this is >> important to scale. >> >> Keep in mind that things that are computed and cached may be lost and >> recomputed even after their parent RDDs were definitely already >> computed and don't seem to be needed. This is why unpersist is often >> the better thing to call because it allows for variables to be >> rebroadcast if needed in this case. Destroy permanently closes the >> broadcast. >> >> On Tue, Aug 30, 2016 at 4:43 PM, Jerry Lam <chiling...@gmail.com> wrote: >> > Hi Sean, >> > >> > Thank you for the response. The only problem is that actively managing >> > broadcast variables require to return the broadcast variables to the >> > caller >> > if the function that creates the broadcast variables does not contain >> > any >> > action. That is the scope that uses the broadcast variables cannot >> > destroy >> > the broadcast variables in many cases. For example: >> > >> > ============== >> > def perfromTransformation(rdd: RDD[int]) = { >> > val sharedMap = sc.broadcast(map) >> > rdd.map{id => >> > val localMap = sharedMap.vlaue >> > (id, localMap(id)) >> > } >> > } >> > >> > def main = { >> > .... >> > performTransformation(rdd).toDF("id", >> > "i").write.parquet("dummy_example") >> > } >> > ============== >> > >> > In this example above, we cannot destroy the sharedMap before the >> > write.parquet is executed because RDD is evaluated lazily. We will get a >> > exception if I put sharedMap.destroy like this: >> > >> > ============== >> > def perfromTransformation(rdd: RDD[int]) = { >> > val sharedMap = sc.broadcast(map) >> > val result = rdd.map{id => >> > val localMap = sharedMap.vlaue >> > (id, localMap(id)) >> > } >> > sharedMap.destroy >> > result >> > } >> > ============== >> > >> > Am I missing something? Are there better way to do this without >> > returning >> > the broadcast variables to the main function? >> > >> > Best Regards, >> > >> > Jerry >> > >> > >> > >> > On Mon, Aug 29, 2016 at 12:11 PM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> Yes you want to actively unpersist() or destroy() broadcast variables >> >> when they're no longer needed. They can eventually be removed when the >> >> reference on the driver is garbage collected, but you usually would >> >> not want to rely on that. >> >> >> >> On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam <chiling...@gmail.com> >> >> wrote: >> >> > Hello spark developers, >> >> > >> >> > Anyone can shed some lights on the life cycle of the broadcast >> >> > variables? >> >> > Basically, if I have a broadcast variable defined in a loop and for >> >> > each >> >> > iteration, I provide a different value. >> >> > // For example: >> >> > for(i< 1 to 10) { >> >> > val bc = sc.broadcast(i) >> >> > sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id, >> >> > i)}.toDF("id", "i").write.parquet("/dummy_output") >> >> > } >> >> > >> >> > Do I need to active manage the broadcast variable in this case? I >> >> > know >> >> > this >> >> > example is not real but please imagine this broadcast variable can >> >> > hold >> >> > an >> >> > array of 1M Long. >> >> > >> >> > Regards, >> >> > >> >> > Jerry >> >> > >> >> > >> >> > >> >> > On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam <chiling...@gmail.com> >> >> > wrote: >> >> >> >> >> >> Hello spark developers, >> >> >> >> >> >> Can someone explain to me what is the lifecycle of a broadcast >> >> >> variable? >> >> >> When a broadcast variable will be garbage-collected at the >> >> >> driver-side >> >> >> and >> >> >> at the executor-side? Does a spark application need to actively >> >> >> manage >> >> >> the >> >> >> broadcast variables to ensure that it will not run in OOM? >> >> >> >> >> >> Best Regards, >> >> >> >> >> >> Jerry >> >> > >> >> > >> > >> > > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org