This is an interesting question. I don't have a solution for you, but you
may be interested in taking a look at Anaconda Cluster
<http://continuum.io/anaconda-cluster>.

It's made by the same people behind Conda (an alternative to pip focused on
data science pacakges) and may offer a better way of doing this. Haven't
used it though.

On Thu, May 7, 2015 at 5:20 PM alemagnani <ale.magn...@gmail.com> wrote:

> I am currently using pyspark with a virtualenv.
> Unfortunately I don't have access to the nodes file system and therefore I
> cannot  manually copy the virtual env over there.
>
> I have been using this technique:
>
> I first add a tar ball with the venv
>     sc.addFile(virtual_env_tarball_file)
>
> Then in the code used on the node to do the computation I activate the venv
> like this:
>         venv_location = SparkFiles.get(venv_name)
>         activate_env="%s/bin/activate_this.py" % venv_location
>         execfile(activate_env, dict(__file__=activate_env))
>
> Is there a better way to do this?
> One of the problem with this approach is that in
> spark/python/pyspark/statcounter.py numpy is imported
> before the venv is activated and this can cause conflicts with the venv
> numpy.
>
> Moreover this requires the venv to be sent around in the cluster all the
> time.
> Any suggestions?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to