You need to expose that variable the same way you'd expose any other variable 
in Python that you wanted to see across modules. As long as you share a spark 
context all will work as expected.

http://stackoverflow.com/questions/142545/python-how-to-make-a-cross-module-variable



Sent with Good (www.good.com)


-----Original Message-----
From: Vadim Bichutskiy 
[vadim.bichuts...@gmail.com<mailto:vadim.bichuts...@gmail.com>]
Sent: Thursday, April 23, 2015 12:00 PM Eastern Standard Time
To: Tathagata Das
Cc: user@spark.apache.org
Subject: Re: Map Question

Here it is. How do I access a broadcastVar in a function that's in another 
module (process_stuff.py below):

Thanks,
Vadim

main.py
-------

from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
from process_stuff import myfunc
from metadata import get_metadata

conf = SparkConf().setAppName('My App').setMaster('local[4]')
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 30)
sqlContext = SQLContext(sc)

distFile = ssc.textFileStream("s3n://...")

distFile.foreachRDD(process)

mylist = get_metadata()

print 'BROADCASTING...'
broadcastVar = sc.broadcast(mylist)
print broadcastVar
print broadcastVar.value
print 'FINISHED BROADCASTING...'

## mylist and broadcastVar, broadcastVar.value print fine

def getSqlContextInstance(sparkContext):

        if ('sqlContextSingletonInstance' not in globals()):
            globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
        return globals()['sqlContextSingletonInstance']

def process(rdd):

sqlContext = getSqlContextInstance(rdd.context)

if rdd.take(1):

jsondf = sqlContext.jsonRDD(rdd)

#jsondf.printSchema()

jsondf.registerTempTable('mytable')

stuff = sqlContext.sql("SELECT ...")
stuff_mapped = stuff.map(myfunc)  ###### I want myfunc to see mylist from 
above?????

...

process_stuff.py
----------------------

def myfunc(x):

metadata = broadcastVar.value # NameError: broadcastVar not found -- HOW TO FIX?

...


metadata.py
----------------

def get_metadata():

...

return mylist
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=d750a2b5-528a-47e7-8d0c-df37c6ff3370]ᐧ


On Wed, Apr 22, 2015 at 6:47 PM, Tathagata Das 
<t...@databricks.com<mailto:t...@databricks.com>> wrote:
Can you give full code? especially the myfunc?

On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy 
<vadim.bichuts...@gmail.com<mailto:vadim.bichuts...@gmail.com>> wrote:
Here's what I did:

print 'BROADCASTING...'
broadcastVar = sc.broadcast(mylist)
print broadcastVar
print broadcastVar.value
print 'FINISHED BROADCASTING...'

The above works fine,

but when I call myrdd.map(myfunc) I get NameError: global name 'broadcastVar' 
is not defined

The myfunc function is in a different module. How do I make it aware of 
broadcastVar?
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=cccea2c4-02b9-45f0-9e40-d25891e0ded5]ᐧ

On Wed, Apr 22, 2015 at 2:13 PM, Vadim Bichutskiy 
<vadim.bichuts...@gmail.com<mailto:vadim.bichuts...@gmail.com>> wrote:
Great. Will try to modify the code. Always room to optimize!
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=82843831-9ce6-4e1b-9fe8-72b9b7180fc4]ᐧ

On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das 
<t...@databricks.com<mailto:t...@databricks.com>> wrote:
Absolutely. The same code would work for local as well as distributed mode!

On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy 
<vadim.bichuts...@gmail.com<mailto:vadim.bichuts...@gmail.com>> wrote:
Can I use broadcast vars in local mode?
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=641ba5c3-4ac7-4614-84a9-45aafd24502f]ᐧ

On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das 
<t...@databricks.com<mailto:t...@databricks.com>> wrote:
Yep. Not efficient. Pretty bad actually. That's why broadcast variable were 
introduced right at the very beginning of Spark.



On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy 
<vadim.bichuts...@gmail.com<mailto:vadim.bichuts...@gmail.com>> wrote:
Thanks TD. I was looking into broadcast variables.

Right now I am running it locally...and I plan to move it to "production" on 
EC2.

The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc) but 
I don't think it's efficient?

mylist is filled only once at the start and never changes.

Vadim
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=5aa8db9d-d2c8-49b1-821f-621a3d2aaf87]ᐧ

On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das 
<t...@databricks.com<mailto:t...@databricks.com>> wrote:
Is the mylist present on every executor? If not, then you have to pass it on. 
And broadcasts are the best way to pass them on. But note that once broadcasted 
it will immutable at the executors, and if you update the list at the driver, 
you will have to broadcast it again.

TD

On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy 
<vadim.bichuts...@gmail.com<mailto:vadim.bichuts...@gmail.com>> wrote:
I am using Spark Streaming with Python. For each RDD, I call a map, i.e., 
myrdd.map(myfunc), myfunc is in a separate Python module. In yet another 
separate Python module I have a global list, i.e. mylist, that's populated with 
metadata. I can't get myfunc to see mylist...it's always empty. Alternatively, 
I guess I could pass mylist to map.

Any suggestions?

Thanks,
Vadim









________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Reply via email to