I have created jira for this feature , comments and feedback are welcome about how to improve it and whether it's valuable for users.
https://issues.apache.org/jira/browse/SPARK-13587 Here's some background info and status of this work. Currently, it's not easy for user to add third party python packages in pyspark. - One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency) - Another way is install packages manually on each node (time wasting, and not easy to switch to different environment) Python now has 2 different virtualenv implementation. One is native virtualenv another is through conda. I have implemented POC for this features. Here's one simple command for how to use virtualenv in pyspark bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py There're 4 properties needs to be set - spark.pyspark.virtualenv.enabled (enable virtualenv) - spark.pyspark.virtualenv.type (native/conda are supported, default is native) - spark.pyspark.virtualenv.requirements (requirement file for the dependencies) - spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Best Regards Jeff Zhang