I have created jira for this feature , comments and feedback are welcome
about how to improve it and whether it's valuable for users.

https://issues.apache.org/jira/browse/SPARK-13587


Here's some background info and status of this work.


Currently, it's not easy for user to add third party python packages in
pyspark.

   - One way is to using --py-files (suitable for simple dependency, but
   not suitable for complicated dependency, especially with transitive
   dependency)
   - Another way is install packages manually on each node (time wasting,
   and not easy to switch to different environment)

Python now has 2 different virtualenv implementation. One is native
virtualenv another is through conda.

I have implemented POC for this features. Here's one simple command for how
to use virtualenv in pyspark

bin/spark-submit --master yarn --deploy-mode client --conf
"spark.pyspark.virtualenv.enabled=true" --conf
"spark.pyspark.virtualenv.type=conda" --conf
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt"
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"
 ~/work/virtualenv/spark.py

There're 4 properties needs to be set

   - spark.pyspark.virtualenv.enabled (enable virtualenv)
   - spark.pyspark.virtualenv.type (native/conda are supported, default is
   native)
   - spark.pyspark.virtualenv.requirements (requirement file for the
   dependencies)
   - spark.pyspark.virtualenv.path (path to the executable for for
   virtualenv/conda)






Best Regards

Jeff Zhang

Reply via email to