[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184210#comment-15184210 ]
Mike Sukmanowsky commented on SPARK-13587: ------------------------------------------ [~juliet] I get the concerns relating to Spark supporting a complex virtualenv process. My main objection to only supporting something like --pyspark-python is the difficulty we currently face in locations like Amazon EMR, but really any Spark cluster where nodes are assumed to be added after an application is submitted deals with the issue. We have a bootstrap script which provisions our EMR nodes with required Python dependencies. This approach works alright for a cluster which tends to run very few applications, but if we have multiple tenants, this approach quickly gets unwieldy. Ideally, Spark applications could be submitted from a master node with a user never having to worry about dependency management at the node bootstrapping level. I was thinking that an interesting approach to this problem would be to provide some sort of a --bootstrap option to spark-submit which points to any executable which Spark will run and check for receipt of a 0 exit code before continuing to launch the application itself. This script could obviously execute any code such as creating a virtualenv or conda env and installing requirements. If a non-zero exit code were received, the Spark application would cease to continue. The generalization gets the Spark community away from having to support conda/virtualenv eccentricities. Thoughts? > Support virtualenv in PySpark > ----------------------------- > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark > Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org