[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198363#comment-17198363
 ] 

Apache Spark commented on SPARK-32187:
--

User 'fhoering' has created a pull request for this issue:
https://github.com/apache/spark/pull/29806

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198362#comment-17198362
 ] 

Fabian Höring commented on SPARK-32187:
---

[~hyukjin.kwon]

Voilà: [https://github.com/apache/spark/pull/29806]

I spent some time getting Spark 3.0.1 to work on our cluster, testing all the 
examples with Spark 3.0.1. and getting it more concise.

I had some issues with pyspark 3.0.1, latest pyarrow and latest pandas. So I 
fixed the versions for now to get something merged and then we can still see.

Some other recent blog post if your are interested 
[https://www.inovex.de/blog/isolated-virtual-environments-pyspark/ 
|https://www.inovex.de/blog/isolated-virtual-environments-pyspark/]It is all 
covered in the doc I would say.
 
>From my point of view it looks really good now.

 

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198360#comment-17198360
 ] 

Apache Spark commented on SPARK-32187:
--

User 'fhoering' has created a pull request for this issue:
https://github.com/apache/spark/pull/29806

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-15 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196634#comment-17196634
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

Thank you so much [~fhoering]!

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-09 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193008#comment-17193008
 ] 

Fabian Höring commented on SPARK-32187:
---

Yes, back. Sorry, I had to handle other stuff in priority.

OK. I agree on the ideas (copy/pastable stuff that works, order of sections). I 
will take into account your comments and open a PR.

 

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192607#comment-17192607
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

Hey [~fhoering] are you back now :-)?

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178728#comment-17178728
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

The draft looks good as a start. A couple of comments from my cursory look:

- Let's make sure having copy-and-pastable examples, and let's try to write de 
facto standard given that there are multiple other sites such as 
[http://alkaline-ml.com/2018-07-02-conda-spark/], 
[https://jcristharif.com/venv-pack/spark.html.|https://jcristharif.com/venv-pack/spark.html].
- Let's place the section about shipping zip, egg and .py files onto the top, 
and place pex and virtual environment on the bottom. Arguably it is more common 
to simply use {{ --py-files}} or {{spark.submit.pyFiles}} configuration to ship 
Python packages.

Let's open a PR and loop with other committers to have more reviews. Shipping 
packages is a bit hairy area and there are many other committers who have a 
better insight than me in particular about other clusters Mesos, Kubernates, 
etc.

As for referencing your own stuff, It looks fine. It's okay to mention things 
as a FYI reference.

{quote}
there is no way to set the archives as a config param when not running on YARN. 
I checked the doc and the spark code. So it seems inconsistent. Can you check 
or confirm ?
{quote}

Yes, I think that's correct up to my knowledge.

SPARK-13587 was not merged so PySpark does not support yet. Yes, it would not 
be in the doc at least for now.


> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177765#comment-17177765
 ] 

Fabian Höring commented on SPARK-32187:
---

About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those 
settings:
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py
I don't know they still work but personally I would close the ticket and not 
put this in the doc. I think it is not the right way to to it as it doens't 
scale to 100 executors and can produce race conditions for the task running on 
the same executor (multiple pip installs at the same time on the same node)

 

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177760#comment-17177760
 ] 

Fabian Höring commented on SPARK-32187:
---

[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort 
on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the 
examples. All links target the current doc.
 
[https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be 
nice if you have time have a look and give some feedback on the comments below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only 
alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, 
personally I would recommend to use pex because it is self contained but for 
completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want 
to promote my own stuff here but I think it could probably be helpful for 
people to have something running directly, the examples always strip some code, 
if you think it should not be there we can remove it. I don't mind also moving 
it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option 
in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not 
running on YARN. I checked the doc the the spark code. So it seems 
inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels 
<[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing 
to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say 
something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is 
described here is the lightweight Python way to do it.

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173247#comment-17173247
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

Thank you so much [~fhoering].

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173237#comment-17173237
 ] 

Fabian Höring commented on SPARK-32187:
---

OK. I'll have a look into that next week.

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-05 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171284#comment-17171284
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

[~fhoering], I made one example at SPARK-32507 to refer. Please also see 
https://issues.apache.org/jira/browse/SPARK-31851?focusedCommentId=17171275=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17171275

Would you be able to start writing a page up about PEX?
If you're not used to shipping Python packages with zipped files or .py files, 
you can only write it only about the PEX for now. I can file a separate JIRA 
for that if that's better for you.



> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-07-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151889#comment-17151889
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

FYI, [~fhoering], I filed a JIRA here. Just to give you a bit of more contexts, 
here's the demo https://hyukjin-spark.readthedocs.io/en/latest/ I made.
I will need to do some base works. I will keep updating you.

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org