I am currently using pyspark with a virtualenv.
Unfortunately I don't have access to the nodes file system and therefore I
cannot  manually copy the virtual env over there.

I have been using this technique:

I first add a tar ball with the venv
    sc.addFile(virtual_env_tarball_file)

Then in the code used on the node to do the computation I activate the venv
like this: 
        venv_location = SparkFiles.get(venv_name)
        activate_env="%s/bin/activate_this.py" % venv_location
        execfile(activate_env, dict(__file__=activate_env))

Is there a better way to do this? 
One of the problem with this approach is that in
spark/python/pyspark/statcounter.py numpy is imported
before the venv is activated and this can cause conflicts with the venv
numpy.

Moreover this requires the venv to be sent around in the cluster all the
time.
Any suggestions?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to