[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177760#comment-17177760
 ] 

Fabian Höring commented on SPARK-32187:
---------------------------------------

[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort 
on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the 
examples. All links target the current doc.
 
[https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be 
nice if you have time have a look and give some feedback on the comments below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only 
alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, 
personally I would recommend to use pex because it is self contained but for 
completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want 
to promote my own stuff here but I think it could probably be helpful for 
people to have something running directly, the examples always strip some code, 
if you think it should not be there we can remove it. I don't mind also moving 
it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option 
in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not 
running on YARN. I checked the doc the the spark code. So it seems 
inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels 
<[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing 
to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say 
something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is 
described here is the lightweight Python way to do it.

> User Guide - Shipping Python Package
> ------------------------------------
>
>                 Key: SPARK-32187
>                 URL: https://issues.apache.org/jira/browse/SPARK-32187
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Documentation, PySpark
>    Affects Versions: 3.1.0
>            Reporter: Hyukjin Kwon
>            Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to