[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184210#comment-15184210
 ] 

Mike Sukmanowsky commented on SPARK-13587:
------------------------------------------

[~juliet] I get the concerns relating to Spark supporting a complex virtualenv 
process. My main objection to only supporting something like --pyspark-python 
is the difficulty we currently face in locations like Amazon EMR, but really 
any Spark cluster where nodes are assumed to be added after an application is 
submitted deals with the issue.

We have a bootstrap script which provisions our EMR nodes with required Python 
dependencies. This approach works alright for a cluster which tends to run very 
few applications, but if we have multiple tenants, this approach quickly gets 
unwieldy. Ideally, Spark applications could be submitted from a master node 
with a user never having to worry about dependency management at the node 
bootstrapping level.

I was thinking that an interesting approach to this problem would be to provide 
some sort of a --bootstrap option to spark-submit which points to any 
executable which Spark will run and check for receipt of a 0 exit code before 
continuing to launch the application itself. This script could obviously 
execute any code such as creating a virtualenv or conda env and installing 
requirements. If a non-zero exit code were received, the Spark application 
would cease to continue.

The generalization gets the Spark community away from having to support 
conda/virtualenv eccentricities. Thoughts?

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to