[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

Jeff Zhang (JIRA) Tue, 01 Mar 2016 20:14:00 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228
 ]


Jeff Zhang edited comment on SPARK-13587 at 3/2/16 4:12 AM:
------------------------------------------------------------

This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled    (enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable file for for 
virtualenv/conda which is used for creating virutalenv)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled    (enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

Reply via email to