This is an interesting question. I don't have a solution for you, but you may be interested in taking a look at Anaconda Cluster <http://continuum.io/anaconda-cluster>.
It's made by the same people behind Conda (an alternative to pip focused on data science pacakges) and may offer a better way of doing this. Haven't used it though. On Thu, May 7, 2015 at 5:20 PM alemagnani <ale.magn...@gmail.com> wrote: > I am currently using pyspark with a virtualenv. > Unfortunately I don't have access to the nodes file system and therefore I > cannot manually copy the virtual env over there. > > I have been using this technique: > > I first add a tar ball with the venv > sc.addFile(virtual_env_tarball_file) > > Then in the code used on the node to do the computation I activate the venv > like this: > venv_location = SparkFiles.get(venv_name) > activate_env="%s/bin/activate_this.py" % venv_location > execfile(activate_env, dict(__file__=activate_env)) > > Is there a better way to do this? > One of the problem with this approach is that in > spark/python/pyspark/statcounter.py numpy is imported > before the venv is activated and this can cause conflicts with the venv > numpy. > > Moreover this requires the venv to be sent around in the cluster all the > time. > Any suggestions? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >