I have created jira for this feature , comments and feedback are welcome
about how to improve it and whether it's valuable for users.


Here's some background info and status of this work.

Currently, it's not easy for user to add third party python packages in

   - One way is to using --py-files (suitable for simple dependency, but
   not suitable for complicated dependency, especially with transitive
   - Another way is install packages manually on each node (time wasting,
   and not easy to switch to different environment)

Python now has 2 different virtualenv implementation. One is native
virtualenv another is through conda.

I have implemented POC for this features. Here's one simple command for how
to use virtualenv in pyspark

bin/spark-submit --master yarn --deploy-mode client --conf
"spark.pyspark.virtualenv.enabled=true" --conf
"spark.pyspark.virtualenv.type=conda" --conf
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"

There're 4 properties needs to be set

   - spark.pyspark.virtualenv.enabled (enable virtualenv)
   - spark.pyspark.virtualenv.type (native/conda are supported, default is
   - spark.pyspark.virtualenv.requirements (requirement file for the
   - spark.pyspark.virtualenv.path (path to the executable for for

