[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740419#comment-15740419
 ] 

Nicholas Chammas commented on SPARK-13587:
------------------------------------------

Thanks to a lot of help from [~quasi...@gmail.com] and [his blog post on this 
problem|http://quasiben.github.io/blog/2016/4/15/conda-spark/], I was able to 
develop a solution that works for Spark on YARN:

{code}
set -e

# Both these directories exist on all of our YARN nodes.
# Otherwise, everything else is built and shipped out at submit-time
# with our application.
export HADOOP_CONF_DIR="/etc/hadoop/conf"
export SPARK_HOME="/hadoop/spark/spark-2.0.2-bin-hadoop2.6"

export PATH="$SPARK_HOME/bin:$PATH"

python3 -m venv venv/
source venv/bin/activate

pip install -U pip
pip install -r requirements.pip
pip install -r requirements-dev.pip

deactivate

# This convoluted zip machinery is to ensure that the paths to the files inside 
the zip
# look the same to Python when it runs within YARN.
# If there is a simpler way to express this, I'd be interested to know!
pushd venv/
zip -rq ../venv.zip *
popd
pushd myproject/
zip -rq ../myproject.zip *
popd
pushd tests/
zip -rq ../tests.zip *
popd

export PYSPARK_PYTHON="venv/bin/python"

spark-submit \
  --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \
  --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
  --master yarn \
  --deploy-mode client \
  --archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \
  run_tests.py -v
{code}

My solution is based off of Ben's, except where Ben uses Conda I just use pip. 
I don't know if there is a way to adapt this solution to work with Spark on 
Mesos or Spark Standalone (and I haven't tried since my environment is YARN), 
but if someone figures it out please post your solution here!

As Ben explains in [his blog 
post|http://quasiben.github.io/blog/2016/4/15/conda-spark/], this lets you 
build and ship an isolated environment with your PySpark application out to the 
YARN cluster. The YARN nodes don't even need to have the correct version of 
Python (or Python at all!) installed, because you are shipping out a complete 
Python environment via the {{--archives}} option.

I hope this helps some people who are looking for a workaround they can use 
today while a more robust solution is developed directly into Spark.

And I wonder... if this {{--archives}} technique can be extended or translated 
to Mesos and Standalone somehow, maybe that would be a good enough solution for 
the time being? People would be able to run their jobs in an isolated Python 
environment using their tool of choice (conda or pip), and Spark wouldn't need 
to add any virtualenv-specific machinery.

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to