[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

Prasanna Santhanam (JIRA) Tue, 08 Nov 2016 03:50:41 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647331#comment-15647331
 ]


Prasanna Santhanam commented on SPARK-13587:
--------------------------------------------

Thanks for this JIRA and the work related to it - I have been testing this 
patch a little with conda environments.

Previously, I have had reasonable success with zipping the contents of my conda 
environment in the gateway/driver node and submitting the zip file as an 
argument to {{--archives}} in the {{spark-submit}} command line. This approach 
works perfectly because it uses the existing spark infrastructure to distribute 
dependencies through to the workers. You actually don't even need anaconda 
installed on the workers since the zip can package the entire python 
installation within it. The downside of it being that conda zip files can bloat 
up quickly in a production spark application.

[~zjffdu] In your approach I find that the driver program still executes on the 
native python installation and only the workers run within conda (virtualenv) 
environments. Would it not be possible to use the same conda environment 
throughout? ie. setup once on gateway node and propagate over the distributed 
cache as mentioned in a related PR comment.

I can always force the driver python to be conda using `PYSPARK_PYTHON` and 
`PYSPARK_DRIVER_PYTHON` but that is not the same conda environment as the one 
created by your PythonWorkerFactory. Or is it not your intention to make it 
work this way?

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

Reply via email to