HyukjinKwon opened a new pull request #30486:
URL: https://github.com/apache/spark/pull/30486


   ### What changes were proposed in this pull request?
   
   TL;DR:
   - This PR completes the support of archives in Spark itself instead of 
Yarn-only
   -  After this PR, PySpark users can use Conda to ship Python packages 
together as below:
       ```python
       conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 
pandas==1.1.4 conda-pack==0.5.0
       conda activate pyspark_env
       conda pack -f -o pyspark_env.tar.gz
       PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python 
pyspark --archives pyspark_env.tar.gz#environment
      ```
   
   
   This PR proposes to add Spark's native `--archives` in Spark submit, and 
`spark.archives` configuration. Currently, both are supported only in Yarn mode:
   
   ```bash
   ./bin/spark-submit --help
   ```
   
   ```
   Options:
   ...
    Spark on YARN only:
     --queue QUEUE_NAME          The YARN queue to submit to (Default: 
"default").
     --archives ARCHIVES         Comma separated list of archives to be 
extracted into the
                                 working directory of each executor.
   ```
   
   This `archives` feature is useful often when you have to ship a directory 
and unpack into executors. One example is native libraries to use e.g. JNI. 
Another example is to ship Python packages together by Conda environment.
   
   Especially for Conda, PySpark currently does not have a nice way to ship a 
package that works in general, please see also 
https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment
 (PySpark new documentation demo for 3.1.0).
   
   The neatest way is arguably to use Conda environment by shipping zipped 
Conda environment but this is currently dependent on this archive feature. NOTE 
that we are able to use `spark.files` by relying on its undocumented behaviour 
that untars `tar.gz` but I don't think we should document such ways and promote 
people to more rely on it.
   
   Also, note that this PR does not target to add the feature parity of 
`spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented 
that this is an experimental feature as well.
   
   ### Why are the changes needed?
   
   To complete the feature parity, and to provide a better support of shipping 
Python libraries together with Conda env.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a 
new configuration `spark.archives`.
   
   ### How was this patch tested?
   
   I added unittests. Also, manually tested in standalone cluster, 
local-cluster, and local modes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to