[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737806#comment-16737806 ]
Hyukjin Kwon commented on SPARK-25433: -------------------------------------- The blog is actually pretty cool > Add support for PEX in PySpark > ------------------------------ > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.2.2 > Reporter: Fabian Höring > Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors using [PEX|https://github.com/pantsbuild/pex] > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing PYSPARK_PYTHON env > variable should already work. > I also have seen this > [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the packages on > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where pex comes in. It is a nice way to create a single > executable zip file with all dependencies included. You have the pex command > line tool to build your package and when it is built you are sure it works. > This is in my opinion the most elegant way to ship python code (better than > virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org