This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.1 by this push: new 16739f3 [SPARK-33824][PYTHON][DOCS] Restructure and improve Python package management page 16739f3 is described below commit 16739f3cece54adaae27757c90f0003f417757f0 Author: HyukjinKwon <gurwls...@apache.org> AuthorDate: Fri Dec 18 10:03:07 2020 +0900 [SPARK-33824][PYTHON][DOCS] Restructure and improve Python package management page ### What changes were proposed in this pull request? This PR proposes to restructure and refine the Python dependency management page. I lately wrote a blog post which will be published soon, and decided contribute some of the contents back to PySpark documentation. FWIW, it has been reviewed by some tech writers and engineers. I built the site for making the review easier: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It's doc change but only in unreleased bracnhs for now. ### How was this patch tested? I manually built the docs as: ```bash cd python/docs make clean html open ``` Closes #30822 from HyukjinKwon/SPARK-33824. Authored-by: HyukjinKwon <gurwls...@apache.org> Signed-off-by: HyukjinKwon <gurwls...@apache.org> (cherry picked from commit 6315118676c99ccef2566c50ab9873de8876e468) Signed-off-by: HyukjinKwon <gurwls...@apache.org> --- python/docs/source/user_guide/python_packaging.rst | 200 +++++++++++++-------- 1 file changed, 125 insertions(+), 75 deletions(-) diff --git a/python/docs/source/user_guide/python_packaging.rst b/python/docs/source/user_guide/python_packaging.rst index 0aff6dc..71d8e53 100644 --- a/python/docs/source/user_guide/python_packaging.rst +++ b/python/docs/source/user_guide/python_packaging.rst @@ -17,7 +17,7 @@ ========================= -3rd Party Python Packages +Python Package Management ========================= When you want to run your PySpark application on a cluster such as YARN, Kubernetes, Mesos, etc., you need to make @@ -51,10 +51,11 @@ Here is the script ``app.py`` from the previous example that will be executed on main(SparkSession.builder.getOrCreate()) -There are multiple ways to ship the dependencies to the cluster: +There are multiple ways to manage Python dependencies in the cluster: - Using PySpark Native Features -- Using Zipped Virtual Environment +- Using Conda +- Using Virtualenv - Using PEX @@ -62,54 +63,51 @@ Using PySpark Native Features ----------------------------- PySpark allows to upload Python files (``.py``), zipped Python packages (``.zip``), and Egg files (``.egg``) -to the executors by setting the configuration setting ``spark.submit.pyFiles`` or by directly -calling :meth:`pyspark.SparkContext.addPyFile`. +to the executors by: -This is an easy way to ship additional custom Python code to the cluster. You can just add individual files or zip whole -packages and upload them. Using :meth:`pyspark.SparkContext.addPyFile` allows to upload code -even after having started your job. +- Setting the configuration setting ``spark.submit.pyFiles`` +- Setting ``--py-files`` option in Spark scripts +- Directly calling :meth:`pyspark.SparkContext.addPyFile` in applications -Note that it doesn't allow to add packages built as `Wheels <https://www.python.org/dev/peps/pep-0427/>`_ and therefore doesn't -allow to include dependencies with native code. +This is a straightforward method to ship additional custom Python code to the cluster. You can just add individual files or zip whole +packages and upload them. Using :meth:`pyspark.SparkContext.addPyFile` allows to upload code even after having started your job. +However, it does not allow to add packages built as `Wheels <https://www.python.org/dev/peps/pep-0427/>`_ and therefore +does not allow to include dependencies with native code. -Using Zipped Virtual Environment --------------------------------- -The idea of zipped environments is to zip your whole `virtual environment <https://docs.python.org/3/tutorial/venv.html>`_, -ship it to the cluster, unzip it remotely and target the Python interpreter from inside this zipped environment. +Using Conda +----------- -Zip Virtual Environment -~~~~~~~~~~~~~~~~~~~~~~~ +`Conda <https://docs.conda.io/en/latest/>`_ is one of the most widely-used Python package management systems. PySpark users can directly +use a Conda environment to ship their third-party Python packages by leveraging +`conda-pack <https://conda.github.io/conda-pack/spark.html>`_ which is a command line tool creating +relocatable Conda environments. -You can zip the virtual environment on your own or use tools for doing this: - -* `conda-pack <https://conda.github.io/conda-pack/spark.html>`_ for conda environments -* `venv-pack <https://jcristharif.com/venv-pack/spark.html>`_ for virtual environments - -Example with `conda-pack`: +The example below creates a Conda environment to use on both the driver and executor and packs +it into an archive file. This archive file captures the Conda environment for Python and stores +both Python interpreter and all its relevant dependencies. .. code-block:: bash - conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0 - conda activate pyspark_env - conda pack -f -o pyspark_env.tar.gz - -Upload to Spark Executors -~~~~~~~~~~~~~~~~~~~~~~~~~ + conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack + conda activate pyspark_conda_env + conda pack -f -o pyspark_conda_env.tar.gz -Unzipping will be done by Spark when using target ``--archives`` option in spark-submit -or setting ``spark.archives`` configuration. +After that, you can ship it together with scripts or in the code by using the ``--archives`` option +or ``spark.archives`` configuration (``spark.yarn.dist.archives`` in YARN). It automatically unpacks the archive on executors. -Example with ``spark-submit``: +In the case of a ``spark-submit`` script, you can use it as follows: .. code-block:: bash export PYSPARK_DRIVER_PYTHON=python export PYSPARK_PYTHON=./environment/bin/python - spark-submit --master=... --archives pyspark_env.tar.gz#environment app.py + spark-submit --archives pyspark_conda_env.tar.gz#environment app.py -Example using ``SparkSession.builder``: +Note that ``PYSPARK_DRIVER_PYTHON`` above is not required for cluster modes in YARN or Kubernetes. + +If you’re on a regular Python shell or notebook, you can try it as shown below: .. code-block:: python @@ -118,67 +116,117 @@ Example using ``SparkSession.builder``: from app import main os.environ['PYSPARK_PYTHON'] = "./environment/bin/python" - spark = SparkSession.builder.master("...").config("spark.archives", "pyspark_env.tar.gz#environment").getOrCreate() + spark = SparkSession.builder.config( + "spark.archives", # 'spark.yarn.dist.archives' in YARN. + "pyspark_conda_env.tar.gz#environment").getOrCreate() main(spark) -Example with ``pyspark`` shell: +For a pyspark shell: .. code-block:: bash export PYSPARK_DRIVER_PYTHON=python export PYSPARK_PYTHON=./environment/bin/python - pyspark --master=... --archives pyspark_env.tar.gz#environment + pyspark --archives pyspark_conda_env.tar.gz#environment -Using PEX ---------- +Using Virtualenv +---------------- -`PEX <https://github.com/pantsbuild/pex>`_ is a library for generating ``.pex`` (Python EXecutable) files. -A PEX file is a self-contained executable Python environment. It can be seen as the Python equivalent of Java uber-JARs (a.k.a. fat JARs). +`Virtualenv <https://virtualenv.pypa.io/en/latest/>`_ is a Python tool to create isolated Python environments. +Since Python 3.3, a subset of its features has been integrated into Python as a standard library under +the `venv <https://docs.python.org/3/library/venv.html>`_ module. PySpark users can use virtualenv to manage +Python dependencies in their clusters by using `venv-pack <https://jcristharif.com/venv-pack/index.html>`_ +in a similar way as conda-pack. -You need to build the PEX file somewhere with all your requirements and then upload it to each Spark executor. +A virtual environment to use on both driver and executor can be created as demonstrated below. +It packs the current virtual environment to an archive file, and It self-contains both Python interpreter +and the dependencies. -Using CLI to Build PEX file -~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash - pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex + python -m venv pyspark_venv + source pyspark_venv/bin/activate + pip install pyarrow pandas venv-pack + venv-pack -o pyspark_venv.tar.gz +You can directly pass/unpack the archive file and enable the environment on executors by leveraging +the ``--archives`` option or ``spark.archives`` configuration (``spark.yarn.dist.archives`` in YARN). -Invoking the PEX file will by default invoke the Python interpreter. pyarrow, pandas and pyspark will be included in the PEX file. +For ``spark-submit``, you can use it by running the command as follows. Also, notice that +``PYSPARK_DRIVER_PYTHON`` is not necessary in Kubernetes or YARN cluster modes. .. code-block:: bash - ./myarchive.pex - Python 3.6.6 (default, Jan 26 2019, 16:53:05) - (InteractiveConsole) - >>> import pyarrow - >>> import pandas - >>> import pyspark - >>> + export PYSPARK_DRIVER_PYTHON=python + export PYSPARK_PYTHON=./environment/bin/python + spark-submit --archives pyspark_venv.tar.gz#environment app.py -This can also be done directly with the Python API. For more information on how to build PEX files, -please refer to `Building .pex files <https://pex.readthedocs.io/en/stable/buildingpex.html>`_ +For regular Python shells or notebooks: -Upload to Spark Executors -~~~~~~~~~~~~~~~~~~~~~~~~~ +.. code-block:: bash -The upload can be done by setting ``--files`` option in spark-submit or setting ``spark.files`` configuration (``spark.yarn.dist.files`` on YARN) -and changing the ``PYSPARK_PYTHON`` environment variable to change the Python interpreter to the PEX executable on each executor. + import os + from pyspark.sql import SparkSession + from app import main -.. - TODO: we should also document the way on other cluster modes. + os.environ['PYSPARK_PYTHON'] = "./environment/bin/python" + spark = SparkSession.builder.config( + "spark.archives", # 'spark.yarn.dist.archives' in YARN. + "pyspark_venv.tar.gz#environment").getOrCreate() + main(spark) -Example with ``spark-submit`` on YARN: +In the case of a pyspark shell: .. code-block:: bash export PYSPARK_DRIVER_PYTHON=python - export PYSPARK_PYTHON=./myarchive.pex - spark-submit --master=yarn --deploy-mode client --files myarchive.pex app.py + export PYSPARK_PYTHON=./environment/bin/python + pyspark --archives pyspark_venv.tar.gz#environment + + +Using PEX +--------- -Example using ``SparkSession.builder`` on YARN: +PySpark can also use `PEX <https://github.com/pantsbuild/pex>`_ to ship the Python packages +together. PEX is a tool that creates a self-contained Python environment. This is similar +to Conda or virtualenv, but a ``.pex`` file is executable by itself. + +The following example creates a ``.pex`` file for the driver and executor to use. +The file contains the Python dependencies specified with the ``pex`` command. + +.. code-block:: bash + + pip install pyarrow pandas pex + pex pyspark pyarrow pandas -o pyspark_pex_env.pex + +This file behaves similarly with a regular Python interpreter. + +.. code-block:: bash + + ./pyspark_pex_env.pex -c "import pandas; print(pandas.__version__)" + 1.1.5 + +However, ``.pex`` file does not include a Python interpreter itself under the hood so all +nodes in a cluster should have the same Python interpreter installed. + +In order to transfer and use the ``.pex`` file in a cluster, you should ship it via the +``spark.files`` configuration (``spark.yarn.dist.files`` in YARN) or ``--files`` option because they are regular files instead +of directories or archive files. + +For application submission, you run the commands as shown below. +Note that ``PYSPARK_DRIVER_PYTHON`` is not needed for cluster modes in YARN or Kubernetes, +and you may also need to set ``PYSPARK_PYTHON`` environment variable on +the AppMaster ``--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./myarchive.pex`` in YARN cluster mode. + +.. code-block:: bash + + export PYSPARK_DRIVER_PYTHON=python + export PYSPARK_PYTHON=./pyspark_pex_env.pex + spark-submit --files pyspark_pex_env.pex app.py + +For regular Python shells or notebooks: .. code-block:: python @@ -186,19 +234,21 @@ Example using ``SparkSession.builder`` on YARN: from pyspark.sql import SparkSession from app import main - os.environ['PYSPARK_PYTHON']="./myarchive.pex" - builder = SparkSession.builder - builder.master("yarn") \ - .config("spark.submit.deployMode", "client") \ - .config("spark.yarn.dist.files", "myarchive.pex") - spark = builder.getOrCreate() + os.environ['PYSPARK_PYTHON'] = "./pyspark_pex_env.pex" + spark = SparkSession.builder.config( + "spark.files", # 'spark.yarn.dist.files' in YARN. + "pyspark_pex_env.pex").getOrCreate() main(spark) -Notes -~~~~~ +For the interactive pyspark shell, the commands are almost the same: -* The Python interpreter that has been used to generate the PEX file must be available on each executor. PEX doesn't include the Python interpreter. +.. code-block:: bash -* In YARN cluster mode you may also need to set ``PYSPARK_PYTHON`` environment variable on the AppMaster ``--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./myarchive.pex``. + export PYSPARK_DRIVER_PYTHON=python + export PYSPARK_PYTHON=./pyspark_pex_env.pex + pyspark --files pyspark_pex_env.pex -* An end-to-end Docker example for deploying a standalone PySpark with ``SparkSession.builder`` and PEX can be found `here <https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md>`_ - it uses cluster-pack, a library on top of PEX that automatizes the the intermediate step of having to create & upload the PEX manually. +An end-to-end Docker example for deploying a standalone PySpark with ``SparkSession.builder`` and PEX +can be found `here <https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md>`_ +- it uses cluster-pack, a library on top of PEX that automatizes the the intermediate step of having +to create & upload the PEX manually. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org