Hi PySparkers, What currently is the best way of shipping self-contained pyspark jobs with 3rd-party dependencies? There are some open JIRA issues [1], [2] as well as corresponding PRs [3], [4] and articles [5], [6], [7] regarding setting up the python environment with conda and virtualenv respectively, and I believe [7] is misleading article, because of unsupported spark options, like spark.pyspark.virtualenv.enabled, spark.pyspark.virtualenv.requirements, etc.
So I'm wondering what the community does in cases, when it's necessary to - prevent python package/module version conflicts between different jobs - prevent updating all the nodes of the cluster in case of new job dependencies - track which dependencies are introduced on the per-job basis [1] https://issues.apache.org/jira/browse/SPARK-13587 [2] https://issues.apache.org/jira/browse/SPARK-16367 [3] https://github.com/apache/spark/pull/13599 [4] https://github.com/apache/spark/pull/14180 [5] https://www.anaconda.com/blog/developer-blog/conda-spark [6] http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv [7] https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org