Best way of shipping self-contained pyspark jobs with 3rd-party dependencies

Sergey Zhemzhitsky Thu, 07 Dec 2017 14:33:06 -0800

Hi PySparkers,

What currently is the best way of shipping self-contained pyspark jobs with
3rd-party dependencies?
There are some open JIRA issues [1], [2] as well as corresponding PRs [3],
[4] and articles [5], [6], regarding setting up the python environment with
conda and virtualenv respectively.


So I'm wondering what the community does in cases, when it's necessary to
- prevent python package/module version conflicts between different jobs
- prevent updating all the nodes of the cluster in case of new job
dependencies
- track which dependencies are introduced on the per-job basis


[1] https://issues.apache.org/jira/browse/SPARK-13587
[2] https://issues.apache.org/jira/browse/SPARK-16367
[3] https://github.com/apache/spark/pull/13599
[4] https://github.com/apache/spark/pull/14180
[5] https://www.anaconda.com/blog/developer-blog/conda-spark/
[6]
http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/

Best way of shipping self-contained pyspark jobs with 3rd-party dependencies

Reply via email to