Here's what I did: print 'BROADCASTING...' broadcastVar = sc.broadcast(mylist) print broadcastVar print broadcastVar.value print 'FINISHED BROADCASTING...'
The above works fine, but when I call myrdd.map(myfunc) I get *NameError: global name 'broadcastVar' is not defined* The myfunc function is in a different module. How do I make it aware of broadcastVar? ᐧ On Wed, Apr 22, 2015 at 2:13 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Great. Will try to modify the code. Always room to optimize! > ᐧ > > On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das <t...@databricks.com> > wrote: > >> Absolutely. The same code would work for local as well as distributed >> mode! >> >> On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy < >> vadim.bichuts...@gmail.com> wrote: >> >>> Can I use broadcast vars in local mode? >>> ᐧ >>> >>> On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das <t...@databricks.com> >>> wrote: >>> >>>> Yep. Not efficient. Pretty bad actually. That's why broadcast variable >>>> were introduced right at the very beginning of Spark. >>>> >>>> >>>> >>>> On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < >>>> vadim.bichuts...@gmail.com> wrote: >>>> >>>>> Thanks TD. I was looking into broadcast variables. >>>>> >>>>> Right now I am running it locally...and I plan to move it to >>>>> "production" on EC2. >>>>> >>>>> The way I fixed it is by doing myrdd.map(lambda x: (x, >>>>> mylist)).map(myfunc) but I don't think it's efficient? >>>>> >>>>> mylist is filled only once at the start and never changes. >>>>> >>>>> Vadim >>>>> ᐧ >>>>> >>>>> On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das <t...@databricks.com> >>>>> wrote: >>>>> >>>>>> Is the mylist present on every executor? If not, then you have to >>>>>> pass it on. And broadcasts are the best way to pass them on. But note >>>>>> that >>>>>> once broadcasted it will immutable at the executors, and if you update >>>>>> the >>>>>> list at the driver, you will have to broadcast it again. >>>>>> >>>>>> TD >>>>>> >>>>>> On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy < >>>>>> vadim.bichuts...@gmail.com> wrote: >>>>>> >>>>>>> I am using Spark Streaming with Python. For each RDD, I call a map, >>>>>>> i.e., myrdd.map(myfunc), myfunc is in a separate Python module. In yet >>>>>>> another separate Python module I have a global list, i.e. mylist, >>>>>>> that's populated with metadata. I can't get myfunc to see mylist...it's >>>>>>> always empty. Alternatively, I guess I could pass mylist to map. >>>>>>> >>>>>>> Any suggestions? >>>>>>> >>>>>>> Thanks, >>>>>>> Vadim >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >