Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das <t...@databricks.com> wrote:
> Yep. Not efficient. Pretty bad actually. That's why broadcast variable > were introduced right at the very beginning of Spark. > > > > On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < > vadim.bichuts...@gmail.com> wrote: > >> Thanks TD. I was looking into broadcast variables. >> >> Right now I am running it locally...and I plan to move it to "production" >> on EC2. >> >> The way I fixed it is by doing myrdd.map(lambda x: (x, >> mylist)).map(myfunc) but I don't think it's efficient? >> >> mylist is filled only once at the start and never changes. >> >> Vadim >> ᐧ >> >> On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das <t...@databricks.com> >> wrote: >> >>> Is the mylist present on every executor? If not, then you have to pass >>> it on. And broadcasts are the best way to pass them on. But note that once >>> broadcasted it will immutable at the executors, and if you update the list >>> at the driver, you will have to broadcast it again. >>> >>> TD >>> >>> On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy < >>> vadim.bichuts...@gmail.com> wrote: >>> >>>> I am using Spark Streaming with Python. For each RDD, I call a map, >>>> i.e., myrdd.map(myfunc), myfunc is in a separate Python module. In yet >>>> another separate Python module I have a global list, i.e. mylist, >>>> that's populated with metadata. I can't get myfunc to see mylist...it's >>>> always empty. Alternatively, I guess I could pass mylist to map. >>>> >>>> Any suggestions? >>>> >>>> Thanks, >>>> Vadim >>>> >>> >>> >> >