Here's what I did:

print 'BROADCASTING...'
broadcastVar = sc.broadcast(mylist)
print broadcastVar
print broadcastVar.value
print 'FINISHED BROADCASTING...'

The above works fine,

but when I call myrdd.map(myfunc) I get *NameError: global name
'broadcastVar' is not defined*

The myfunc function is in a different module. How do I make it aware of
broadcastVar?
ᐧ

On Wed, Apr 22, 2015 at 2:13 PM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:

> Great. Will try to modify the code. Always room to optimize!
> ᐧ
>
> On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>> Absolutely. The same code would work for local as well as distributed
>> mode!
>>
>> On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy <
>> vadim.bichuts...@gmail.com> wrote:
>>
>>> Can I use broadcast vars in local mode?
>>> ᐧ
>>>
>>> On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das <t...@databricks.com>
>>> wrote:
>>>
>>>> Yep. Not efficient. Pretty bad actually. That's why broadcast variable
>>>> were introduced right at the very beginning of Spark.
>>>>
>>>>
>>>>
>>>> On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy <
>>>> vadim.bichuts...@gmail.com> wrote:
>>>>
>>>>> Thanks TD. I was looking into broadcast variables.
>>>>>
>>>>> Right now I am running it locally...and I plan to move it to
>>>>> "production" on EC2.
>>>>>
>>>>> The way I fixed it is by doing myrdd.map(lambda x: (x,
>>>>> mylist)).map(myfunc) but I don't think it's efficient?
>>>>>
>>>>> mylist is filled only once at the start and never changes.
>>>>>
>>>>> Vadim
>>>>> ᐧ
>>>>>
>>>>> On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das <t...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Is the mylist present on every executor? If not, then you have to
>>>>>> pass it on. And broadcasts are the best way to pass them on. But note 
>>>>>> that
>>>>>> once broadcasted it will immutable at the executors, and if you update 
>>>>>> the
>>>>>> list at the driver, you will have to broadcast it again.
>>>>>>
>>>>>> TD
>>>>>>
>>>>>> On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy <
>>>>>> vadim.bichuts...@gmail.com> wrote:
>>>>>>
>>>>>>> I am using Spark Streaming with Python. For each RDD, I call a map,
>>>>>>> i.e., myrdd.map(myfunc), myfunc is in a separate Python module. In yet
>>>>>>> another separate Python module I have a global list, i.e. mylist,
>>>>>>> that's populated with metadata. I can't get myfunc to see mylist...it's
>>>>>>> always empty. Alternatively, I guess I could pass mylist to map.
>>>>>>>
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vadim
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to