[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355172#comment-15355172 ]
Semet edited comment on SPARK-13587 at 6/29/16 12:52 PM: --------------------------------------------------------- yes it looks cool! Here is what I have in mind, tell me if it is the wrong direction - each job should execute in its own environment. - I love wheels, and wheelhouse. Providen the fact we build all the needed wheels on the same machine as the cluster, of we did retrived the right wheels on Pypi, pypi can install all dependencies with lightning speed, without the need of an internet connection (have configure the proxy for some corporates, or handle an internal mirror, etc). - so we deploy the job with a command line such as: {code} bin/spark-submit --master $(spark_master) --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf "spark.pyspark.virtualenv.script=script_name" --conf "spark.pyspark.virtualenv.args='--opt1 --opt2'" {code} so: - {{wheelhouse.zip}} contains the whole wheels to install in a fresh virtualenv. No internet connection, the script it also deployed and installed, provided they go created like a nice module page (so easy to do with pbr) - {{spark.pyspark.virtualenv.script}} is the execution point of the script. It should be declared in the {{script}} section in the {{setup.py}} - {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script I don't have much experience on YARN or MESOS, what are the big differences? was (Author: gae...@xeberon.net): yes it looks cool! Here is what I have in mind, tell me if it is the wrong direction - each job should execute in its own environment. - I love wheels, and wheelhouse. Providen the fact we build all the needed wheels on the same machine as the cluster, of we did retrived the right wheels on Pypi, pypi can install all dependencies with lightning speed, without the need of an internet connection (have configure the proxy for some corporates, or handle an internal mirror, etc). - so we deploy the job with a command line such as: {code} bin/spark-submit --master $(spark_master) --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf "spark.pyspark.virtualenv.script=script_name" --conf "spark.pyspark.virtualenv.args='--opt1 --opt2'" {code} so: - {{wheelhouse.zip}} contains the whole wheels to install in a fresh virtualenv. No internet connection, the script it also deployed and installed, provided they go created like a nice module page (so easy to do with pbr) - {{spark.pyspark.virtualenv.script}} is the execution point of the script. It should be declared in the {{script}} section in the {{setup.py}} - {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script > Support virtualenv in PySpark > ----------------------------- > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark > Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org