[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740419#comment-15740419 ]
Nicholas Chammas commented on SPARK-13587: ------------------------------------------ Thanks to a lot of help from [~quasi...@gmail.com] and [his blog post on this problem|http://quasiben.github.io/blog/2016/4/15/conda-spark/], I was able to develop a solution that works for Spark on YARN: {code} set -e # Both these directories exist on all of our YARN nodes. # Otherwise, everything else is built and shipped out at submit-time # with our application. export HADOOP_CONF_DIR="/etc/hadoop/conf" export SPARK_HOME="/hadoop/spark/spark-2.0.2-bin-hadoop2.6" export PATH="$SPARK_HOME/bin:$PATH" python3 -m venv venv/ source venv/bin/activate pip install -U pip pip install -r requirements.pip pip install -r requirements-dev.pip deactivate # This convoluted zip machinery is to ensure that the paths to the files inside the zip # look the same to Python when it runs within YARN. # If there is a simpler way to express this, I'd be interested to know! pushd venv/ zip -rq ../venv.zip * popd pushd myproject/ zip -rq ../myproject.zip * popd pushd tests/ zip -rq ../tests.zip * popd export PYSPARK_PYTHON="venv/bin/python" spark-submit \ --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \ --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \ --master yarn \ --deploy-mode client \ --archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \ run_tests.py -v {code} My solution is based off of Ben's, except where Ben uses Conda I just use pip. I don't know if there is a way to adapt this solution to work with Spark on Mesos or Spark Standalone (and I haven't tried since my environment is YARN), but if someone figures it out please post your solution here! As Ben explains in [his blog post|http://quasiben.github.io/blog/2016/4/15/conda-spark/], this lets you build and ship an isolated environment with your PySpark application out to the YARN cluster. The YARN nodes don't even need to have the correct version of Python (or Python at all!) installed, because you are shipping out a complete Python environment via the {{--archives}} option. I hope this helps some people who are looking for a workaround they can use today while a more robust solution is developed directly into Spark. And I wonder... if this {{--archives}} technique can be extended or translated to Mesos and Standalone somehow, maybe that would be a good enough solution for the time being? People would be able to run their jobs in an isolated Python environment using their tool of choice (conda or pip), and Spark wouldn't need to add any virtualenv-specific machinery. > Support virtualenv in PySpark > ----------------------------- > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark > Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org