Virtualenv pyspark

2015-05-07 Thread alemagnani
I am currently using pyspark with a virtualenv.
Unfortunately I don't have access to the nodes file system and therefore I
cannot  manually copy the virtual env over there.

I have been using this technique:

I first add a tar ball with the venv
sc.addFile(virtual_env_tarball_file)

Then in the code used on the node to do the computation I activate the venv
like this: 
venv_location = SparkFiles.get(venv_name)
activate_env="%s/bin/activate_this.py" % venv_location
execfile(activate_env, dict(__file__=activate_env))

Is there a better way to do this? 
One of the problem with this approach is that in
spark/python/pyspark/statcounter.py numpy is imported
before the venv is activated and this can cause conflicts with the venv
numpy.

Moreover this requires the venv to be sent around in the cluster all the
time.
Any suggestions?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Virtualenv pyspark

2015-05-08 Thread Nicholas Chammas
This is an interesting question. I don't have a solution for you, but you
may be interested in taking a look at Anaconda Cluster
<http://continuum.io/anaconda-cluster>.

It's made by the same people behind Conda (an alternative to pip focused on
data science pacakges) and may offer a better way of doing this. Haven't
used it though.

On Thu, May 7, 2015 at 5:20 PM alemagnani  wrote:

> I am currently using pyspark with a virtualenv.
> Unfortunately I don't have access to the nodes file system and therefore I
> cannot  manually copy the virtual env over there.
>
> I have been using this technique:
>
> I first add a tar ball with the venv
> sc.addFile(virtual_env_tarball_file)
>
> Then in the code used on the node to do the computation I activate the venv
> like this:
> venv_location = SparkFiles.get(venv_name)
> activate_env="%s/bin/activate_this.py" % venv_location
> execfile(activate_env, dict(__file__=activate_env))
>
> Is there a better way to do this?
> One of the problem with this approach is that in
> spark/python/pyspark/statcounter.py numpy is imported
> before the venv is activated and this can cause conflicts with the venv
> numpy.
>
> Moreover this requires the venv to be sent around in the cluster all the
> time.
> Any suggestions?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>