I am currently using pyspark with a virtualenv. Unfortunately I don't have access to the nodes file system and therefore I cannot manually copy the virtual env over there.
I have been using this technique: I first add a tar ball with the venv sc.addFile(virtual_env_tarball_file) Then in the code used on the node to do the computation I activate the venv like this: venv_location = SparkFiles.get(venv_name) activate_env="%s/bin/activate_this.py" % venv_location execfile(activate_env, dict(__file__=activate_env)) Is there a better way to do this? One of the problem with this approach is that in spark/python/pyspark/statcounter.py numpy is imported before the venv is activated and this can cause conflicts with the venv numpy. Moreover this requires the venv to be sent around in the cluster all the time. Any suggestions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org