Thanks Ilya. I am having trouble doing that. Can you give me an example?
ᐧ

On Thu, Apr 23, 2015 at 12:06 PM, Ganelin, Ilya <ilya.gane...@capitalone.com
> wrote:

>  You need to expose that variable the same way you'd expose any other
> variable in Python that you wanted to see across modules. As long as you
> share a spark context all will work as expected.
>
>
> http://stackoverflow.com/questions/142545/python-how-to-make-a-cross-module-variable
>
>
>
> Sent with Good (www.good.com)
>
>
>
> -----Original Message-----
> *From: *Vadim Bichutskiy [vadim.bichuts...@gmail.com]
> *Sent: *Thursday, April 23, 2015 12:00 PM Eastern Standard Time
> *To: *Tathagata Das
> *Cc: *user@spark.apache.org
> *Subject: *Re: Map Question
>
> Here it is. How do I access a broadcastVar in a function that's in another
> module (process_stuff.py below):
>
> Thanks,
> Vadim
>
>  main.py
> -------
>
> from pyspark import SparkContext, SparkConf
> from pyspark.streaming import StreamingContext
> from pyspark.sql import SQLContext
> from process_stuff import myfunc
> from metadata import get_metadata
>
> conf = SparkConf().setAppName('My App').setMaster('local[4]')
> sc = SparkContext(conf=conf)
> ssc = StreamingContext(sc, 30)
> sqlContext = SQLContext(sc)
>
> distFile = ssc.textFileStream("s3n://...")
>
> distFile.foreachRDD(process)
>
> mylist = get_metadata()
>
> print 'BROADCASTING...'
> broadcastVar = sc.broadcast(mylist)
> print broadcastVar
> print broadcastVar.value
> print 'FINISHED BROADCASTING...'
>
> ## mylist and broadcastVar, broadcastVar.value print fine
>
> def getSqlContextInstance(sparkContext):
>
>         if ('sqlContextSingletonInstance' not in globals()):
>             globals()['sqlContextSingletonInstance'] =
> SQLContext(sparkContext)
>         return globals()['sqlContextSingletonInstance']
>
> def process(rdd):
>
> sqlContext = getSqlContextInstance(rdd.context)
>
> if rdd.take(1):
>
> jsondf = sqlContext.jsonRDD(rdd)
>
> #jsondf.printSchema()
>
> jsondf.registerTempTable('mytable')
>
> stuff = sqlContext.sql("SELECT ...")
> stuff_mapped = stuff.map(myfunc)  ###### I want myfunc to see mylist from
> above?????
>
> ...
>
> process_stuff.py
> ----------------------
>
> def myfunc(x):
>
> metadata = broadcastVar.value # NameError: broadcastVar not found -- HOW
> TO FIX?
>
> ...
>
>
> metadata.py
> ----------------
>
> def get_metadata():
>
> ...
>
> return mylist
>  ᐧ
>
> On Wed, Apr 22, 2015 at 6:47 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>> Can you give full code? especially the myfunc?
>>
>> On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy <
>> vadim.bichuts...@gmail.com> wrote:
>>
>>> Here's what I did:
>>>
>>>  print 'BROADCASTING...'
>>> broadcastVar = sc.broadcast(mylist)
>>> print broadcastVar
>>> print broadcastVar.value
>>> print 'FINISHED BROADCASTING...'
>>>
>>> The above works fine,
>>>
>>> but when I call myrdd.map(myfunc) I get *NameError: global name
>>> 'broadcastVar' is not defined*
>>>
>>>  The myfunc function is in a different module. How do I make it aware
>>> of broadcastVar?
>>> ᐧ
>>>
>>> On Wed, Apr 22, 2015 at 2:13 PM, Vadim Bichutskiy <
>>> vadim.bichuts...@gmail.com> wrote:
>>>
>>>> Great. Will try to modify the code. Always room to optimize!
>>>> ᐧ
>>>>
>>>>  On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das <t...@databricks.com>
>>>> wrote:
>>>>
>>>>> Absolutely. The same code would work for local as well as distributed
>>>>> mode!
>>>>>
>>>>> On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy <
>>>>> vadim.bichuts...@gmail.com> wrote:
>>>>>
>>>>>> Can I use broadcast vars in local mode?
>>>>>> ᐧ
>>>>>>
>>>>>> On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das <t...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yep. Not efficient. Pretty bad actually. That's why broadcast
>>>>>>> variable were introduced right at the very beginning of Spark.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy <
>>>>>>> vadim.bichuts...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks TD. I was looking into broadcast variables.
>>>>>>>>
>>>>>>>> Right now I am running it locally...and I plan to move it to
>>>>>>>> "production" on EC2.
>>>>>>>>
>>>>>>>> The way I fixed it is by doing myrdd.map(lambda x: (x,
>>>>>>>> mylist)).map(myfunc) but I don't think it's efficient?
>>>>>>>>
>>>>>>>> mylist is filled only once at the start and never changes.
>>>>>>>>
>>>>>>>> Vadim
>>>>>>>> ᐧ
>>>>>>>>
>>>>>>>> On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das <t...@databricks.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>>  Is the mylist present on every executor? If not, then you have
>>>>>>>>> to pass it on. And broadcasts are the best way to pass them on. But 
>>>>>>>>> note
>>>>>>>>> that once broadcasted it will immutable at the executors, and if you 
>>>>>>>>> update
>>>>>>>>> the list at the driver, you will have to broadcast it again.
>>>>>>>>>
>>>>>>>>> TD
>>>>>>>>>
>>>>>>>>> On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy <
>>>>>>>>> vadim.bichuts...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I am using Spark Streaming with Python. For each RDD, I call a
>>>>>>>>>> map, i.e., myrdd.map(myfunc), myfunc is in a separate Python module. 
>>>>>>>>>> In yet
>>>>>>>>>> another separate Python module I have a global list, i.e. mylist,
>>>>>>>>>> that's populated with metadata. I can't get myfunc to see 
>>>>>>>>>> mylist...it's
>>>>>>>>>> always empty. Alternatively, I guess I could pass mylist to map.
>>>>>>>>>>
>>>>>>>>>> Any suggestions?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Vadim
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>

Reply via email to