[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--------------------------
    Description: 
*Rational* 
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node. 

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes. 

This ticket proposes to allow users the ability to deploy their job as "Wheels" 
packages. The Python community is strongly advocating to promote this way of 
packaging and distributing Python application as a "standard way of deploying 
Python App". In other word, this is the "Pythonic Way of Deployment".

*Previous approaches* 
I based the current proposal over the two following bugs related to this point: 
- SPARK-6764 ("Wheel support for PySpark") 
- SPARK-13587("Support virtualenv in PySpark")

First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation 

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available. 

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file. 

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally). 

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org. 

*Use Case 1: no internet connectivity* 
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse": 

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies 
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package: 
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt 
{code} 
astroid==1.4.6 # via pylint 
autopep8==1.2.4 
click==6.6 # via pip-tools 
colorama==0.3.7 # via pylint 
enum34==1.1.6 # via hypothesis 
findspark==1.0.0 # via spark-testing-base 
first==2.0.1 # via pip-tools 
hypothesis==3.4.0 # via spark-testing-base 
lazy-object-proxy==1.2.2 # via astroid 
linecache2==1.0.0 # via traceback2 
pbr==1.10.0 
pep8==1.7.0 # via autopep8 
pip-tools==1.6.5 
py==1.4.31 # via pytest 
pyflakes==1.2.3 
pylint==1.5.6 
pytest==2.9.2 # via spark-testing-base 
six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
spark-testing-base==0.0.7.post2 
traceback2==1.4.0 # via unittest2 
unittest2==1.1.0 # via spark-testing-base 
wheel==0.29.0 
wrapt==1.10.8 # via astroid 
{code} 
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy 
-- create a virtualenv if not already in one: 
{code} 
virtualenv env 
{code} 
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need. 
- create the wheelhouse for your current project 
{code} 
pip install wheelhouse 
pip wheel . --wheel-dir wheelhouse 
{code} 
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}. 
- zip it into a {{wheelhouse.zip}}. 

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically. 

Now comes the time to submit the project: 
{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
{code} 

You can see that: 
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument. 
- The wheelhouse deployment is triggered by the {{\-\-conf 
"spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} 
and {{wheelhouse.zip}} are copied through {{--files}}. The names of both files 
can be changed through {{\-\-conf}} arguments. I guess with a proper 
documentation this might not be a problem 
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, to pick 
only the wheels he needs. For example, if you have a package compiled on 32 
bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will only 
select the right one 
- I have choosen to keep the script at the end of the command line, but for me 
it is just a launcher script, it can only be 4 lines: 
{code} 
/#!/usr/bin/env python  

from mypackage import run 
run() 
{code} 
- on each node, a new virtualenv is created *at each deployment*. This has a 
cost, but not so much, since the {{pip install}} will only install wheel, no 
compilation nor internet connection will be required. The command line for 
installing the wheel on each node will be like: 
{code} 
pip install --no-index --find-links=/path/to/node/wheelhouse -r 
requirements.txt 
{code} 

*advantages* 
- quick installation, since there is no compilation 
- no Internet connectivity support, no need mess with the corporate proxy or 
require a local mirroring of pypi. 
- package versionning isolation (two spark job can depends on two different 
version of a given library) 

*disadvantages* 
- creating a virtualenv at each execution takes time, not that much but still 
it can take some seconds 
- and disk space 
- slighly more complex to setup than sending a simple python script, but this 
feature is not lost 
- support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but 
one has to send all wheels flavours and ensure pip is able to install in every 
environment. The complexity of this task is on the hands of the developer and 
no more on the IT persons! (TMHO, this is an advantage) 

*Use Case 2: the Spark cluster has access to Pypi or a mirror of Pypi* 

This is the more elegant situation. The Spark cluster (each node) can install 
the dependencies of your project independently from the wheels provided by 
Pypi. Your internal dependencies and your job project can also comes in 
independent wheel files as well. In this case the workflow is much simpler: 

- Turn your project into a Python module 
- write {{requirements.txt}} and {{setup.py}} like in Use Case 1 
- create the wheel with {{pip wheels}}. But now we will not send *ALL* the 
dependencies. Only the one that are not on Pypi (current job project, other 
internal dependencies, etc). 
- no need to create a wheelhouse. You can still copy the wheels either with 
{{--py-files}} (will be automatically installed) or inside a wheelhouse named 
{{wheelhouse.zip}} 

Deployment becomes: 

Now comes the time to submit the project: 
{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/project/requirements.txt --py-files 
/path/to/project/internal_dependency_1.whl,/path/to/project/internal_dependency_2.whl,/path/to/project/current_project.whl
 --conf "spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"; 
~/path/to/launcher_script.py 
{code} 

or with a wheelhouse that only contains internal dependencies and current 
project wheels: 

{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"; 
~/path/to/launcher_script.py 
{code} 

or if you want to use the official Pypi or have configured {{pip.conf}} to hit 
the internal pypi mirror (see doc bellow): 

{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
{code} 

On each node, the deployment will be done with a command such as: 
{code} 
pip install --index-url http://pypi.mycompany.com 
--find-links=/path/to/node/wheelhouse -r requirements.txt 
{code} 

Note: 

- {{\-\-conf "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} 
allows to specify a Pypi mirror, for example a mirror internal to your company 
network. If not provided, the default Pypi mirror (pypi.python.org) will be 
requested 
- to send a wheelhouse, use {{\-\-files}}. To send individual wheels, use 
{{\-\-py-files}}. With the latter, all wheels will be installed. For multiple 
architecture cluster, prepare all needed wheels for all architecture and use a 
wheelhouse archive, this allows {{pip}} to choose the right version of the 
wheel automatically. 

*code submission* 
I already started working on this point, starting by merging the 2 
mergerequests [#5408|https://github.com/apache/spark/pull/5408] and 
[#13599|https://github.com/apache/spark/pull/13599] 
I'll upload a patch asap for review. 
I see two major interogations: 
- I don't know that much YARN or MESOS, so I might require some help for the 
final integration 
- documentation should really be carefully crafted so users are not lost in all 
these concepts 

I really think having this "wheelhouse" support for spark will really helps 
using, maintaining, and evolving Python scripts on Spark. Python has a rich set 
of mature libraries Spark should do anythink to help developers easily access 
and use them in their everyday job. 

*Important notes about some complex package such as numpy* 

Numpy is the kind of package that take several minutes to deploy and we want to 
avoid having all nodes install it each time. Pypi provides several precompiled 
wheel but it may occurs that the wheel are not right for your platform or the 
platform fo your cluster. 

Wheels are *not* cached for pip version < 7.0. From pip v7.0 and +, wheel are 
automatically cached when built (if needed), so the first installation might 
take some time, but after the installation will be straight forward.

On most of my machines, numpy is installed without any compilation thanks to 
wheels

*Certificate* 

pip does not use system ssl certificate. If you use a local pypi mirror behind 
https with internal certificate, you'll have to setup pypi correctly with the 
following content in {{~/.pip/pip.conf}}: 
{code} 
[global] 
cert = /path/to/your/internal/certificates.pem 
{code} 

First creation might take some times, but pip will automatically cache the 
wheel for your system in {{~/.cache/pip/wheels}}. You can of course recreate 
the wheel with {{pip wheel}} or find the wheel in {{~/.cache/pip/wheels}}. You 
can use {{pip -v install numpy}} to see where it has placed the wheel in cache. 

If you use Artifactory, you can upload your wheels at a local, central cache 
that can be shared accross all your slave. See [this 
documentation|https://www.jfrog.com/confluence/display/RTF/PyPI+Repositories#PyPIRepositories-LocalRepositories]
 to see how this works. This way, you can insert wheels in this local cache and 
it will be seens as if it has been uploaded to the official repository (local 
cache + remote cache can be "merged" into a virtual repository with 
artifactory) 

*Set use of internal pypi mirror* 
Ask your IT to update the {{~/.pip/pip.conf}} of the node to point by default 
to the internal mirror: 
{code} 
[global] 
; Low timeout 
timeout = 20 
index-url = https://&lt;user&gt;:&lt;pass&gt;@pypi.mycompany.org/ 
{code} 

Now, no more need to specify the {{\-\-conf 
"spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} in your Spark 
submit command line 

Note: this will not work when installing package with {{python setup.py 
install}} syntax. In this case you need to update {{~/.pypirc}} and use the 
{{-r}} argument. This syntax is not used in spark-submit

  was:
*Rational* 
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node. 

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes. 

*Previous approaches* 
I based the current proposal over the two following bugs related to this point: 
- SPARK-6764 ("Wheel support for PySpark") 
- SPARK-13587("Support virtualenv in PySpark")

First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation 

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available. 

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file. 

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally). 

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org. 

*Use Case 1: no internet connectivity* 
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse": 

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies 
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package: 
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt 
{code} 
astroid==1.4.6 # via pylint 
autopep8==1.2.4 
click==6.6 # via pip-tools 
colorama==0.3.7 # via pylint 
enum34==1.1.6 # via hypothesis 
findspark==1.0.0 # via spark-testing-base 
first==2.0.1 # via pip-tools 
hypothesis==3.4.0 # via spark-testing-base 
lazy-object-proxy==1.2.2 # via astroid 
linecache2==1.0.0 # via traceback2 
pbr==1.10.0 
pep8==1.7.0 # via autopep8 
pip-tools==1.6.5 
py==1.4.31 # via pytest 
pyflakes==1.2.3 
pylint==1.5.6 
pytest==2.9.2 # via spark-testing-base 
six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
spark-testing-base==0.0.7.post2 
traceback2==1.4.0 # via unittest2 
unittest2==1.1.0 # via spark-testing-base 
wheel==0.29.0 
wrapt==1.10.8 # via astroid 
{code} 
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy 
-- create a virtualenv if not already in one: 
{code} 
virtualenv env 
{code} 
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need. 
- create the wheelhouse for your current project 
{code} 
pip install wheelhouse 
pip wheel . --wheel-dir wheelhouse 
{code} 
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}. 
- zip it into a {{wheelhouse.zip}}. 

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically. 

Now comes the time to submit the project: 
{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
{code} 

You can see that: 
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument. 
- The wheelhouse deployment is triggered by the {{\-\-conf 
"spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} 
and {{wheelhouse.zip}} are copied through {{--files}}. The names of both files 
can be changed through {{\-\-conf}} arguments. I guess with a proper 
documentation this might not be a problem 
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, to pick 
only the wheels he needs. For example, if you have a package compiled on 32 
bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will only 
select the right one 
- I have choosen to keep the script at the end of the command line, but for me 
it is just a launcher script, it can only be 4 lines: 
{code} 
/#!/usr/bin/env python  

from mypackage import run 
run() 
{code} 
- on each node, a new virtualenv is created *at each deployment*. This has a 
cost, but not so much, since the {{pip install}} will only install wheel, no 
compilation nor internet connection will be required. The command line for 
installing the wheel on each node will be like: 
{code} 
pip install --no-index --find-links=/path/to/node/wheelhouse -r 
requirements.txt 
{code} 

*advantages* 
- quick installation, since there is no compilation 
- no Internet connectivity support, no need mess with the corporate proxy or 
require a local mirroring of pypi. 
- package versionning isolation (two spark job can depends on two different 
version of a given library) 

*disadvantages* 
- creating a virtualenv at each execution takes time, not that much but still 
it can take some seconds 
- and disk space 
- slighly more complex to setup than sending a simple python script, but this 
feature is not lost 
- support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but 
one has to send all wheels flavours and ensure pip is able to install in every 
environment. The complexity of this task is on the hands of the developer and 
no more on the IT persons! (TMHO, this is an advantage) 

*Use Case 2: the Spark cluster has access to Pypi or a mirror of Pypi* 

This is the more elegant situation. The Spark cluster (each node) can install 
the dependencies of your project independently from the wheels provided by 
Pypi. Your internal dependencies and your job project can also comes in 
independent wheel files as well. In this case the workflow is much simpler: 

- Turn your project into a Python module 
- write {{requirements.txt}} and {{setup.py}} like in Use Case 1 
- create the wheel with {{pip wheels}}. But now we will not send *ALL* the 
dependencies. Only the one that are not on Pypi (current job project, other 
internal dependencies, etc). 
- no need to create a wheelhouse. You can still copy the wheels either with 
{{--py-files}} (will be automatically installed) or inside a wheelhouse named 
{{wheelhouse.zip}} 

Deployment becomes: 

Now comes the time to submit the project: 
{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/project/requirements.txt --py-files 
/path/to/project/internal_dependency_1.whl,/path/to/project/internal_dependency_2.whl,/path/to/project/current_project.whl
 --conf "spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"; 
~/path/to/launcher_script.py 
{code} 

or with a wheelhouse that only contains internal dependencies and current 
project wheels: 

{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"; 
~/path/to/launcher_script.py 
{code} 

or if you want to use the official Pypi or have configured {{pip.conf}} to hit 
the internal pypi mirror (see doc bellow): 

{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
{code} 

On each node, the deployment will be done with a command such as: 
{code} 
pip install --index-url http://pypi.mycompany.com 
--find-links=/path/to/node/wheelhouse -r requirements.txt 
{code} 

Note: 

- {{\-\-conf "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} 
allows to specify a Pypi mirror, for example a mirror internal to your company 
network. If not provided, the default Pypi mirror (pypi.python.org) will be 
requested 
- to send a wheelhouse, use {{\-\-files}}. To send individual wheels, use 
{{\-\-py-files}}. With the latter, all wheels will be installed. For multiple 
architecture cluster, prepare all needed wheels for all architecture and use a 
wheelhouse archive, this allows {{pip}} to choose the right version of the 
wheel automatically. 

*code submission* 
I already started working on this point, starting by merging the 2 
mergerequests [#5408|https://github.com/apache/spark/pull/5408] and 
[#13599|https://github.com/apache/spark/pull/13599] 
I'll upload a patch asap for review. 
I see two major interogations: 
- I don't know that much YARN or MESOS, so I might require some help for the 
final integration 
- documentation should really be carefully crafted so users are not lost in all 
these concepts 

I really think having this "wheelhouse" support for spark will really helps 
using, maintaining, and evolving Python scripts on Spark. Python has a rich set 
of mature libraries Spark should do anythink to help developers easily access 
and use them in their everyday job. 

*Important notes about some complex package such as numpy* 

Numpy is the kind of package that take several minutes to deploy and we want to 
avoid having all nodes install it each time. Pypi provides several precompiled 
wheel but it may occurs that the wheel are not right for your platform or the 
platform fo your cluster. 

Wheels are *not* cached for pip version < 7.0. From pip v7.0 and +, wheel are 
automatically cached when built (if needed), so the first installation might 
take some time, but after the installation will be straight forward.

On most of my machines, numpy is installed without any compilation thanks to 
wheels

*Certificate* 

pip does not use system ssl certificate. If you use a local pypi mirror behind 
https with internal certificate, you'll have to setup pypi correctly with the 
following content in {{~/.pip/pip.conf}}: 
{code} 
[global] 
cert = /path/to/your/internal/certificates.pem 
{code} 

First creation might take some times, but pip will automatically cache the 
wheel for your system in {{~/.cache/pip/wheels}}. You can of course recreate 
the wheel with {{pip wheel}} or find the wheel in {{~/.cache/pip/wheels}}. You 
can use {{pip -v install numpy}} to see where it has placed the wheel in cache. 

If you use Artifactory, you can upload your wheels at a local, central cache 
that can be shared accross all your slave. See [this 
documentation|https://www.jfrog.com/confluence/display/RTF/PyPI+Repositories#PyPIRepositories-LocalRepositories]
 to see how this works. This way, you can insert wheels in this local cache and 
it will be seens as if it has been uploaded to the official repository (local 
cache + remote cache can be "merged" into a virtual repository with 
artifactory) 

*Set use of internal pypi mirror* 
Ask your IT to update the {{~/.pip/pip.conf}} of the node to point by default 
to the internal mirror: 
{code} 
[global] 
; Low timeout 
timeout = 20 
index-url = https://&lt;user&gt;:&lt;pass&gt;@pypi.mycompany.org/ 
{code} 

Now, no more need to specify the {{\-\-conf 
"spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} in your Spark 
submit command line 

Note: this will not work when installing package with {{python setup.py 
install}} syntax. In this case you need to update {{~/.pypirc}} and use the 
{{-r}} argument. This syntax is not used in spark-submit


> Wheelhouse Support for PySpark
> ------------------------------
>
>                 Key: SPARK-16367
>                 URL: https://issues.apache.org/jira/browse/SPARK-16367
>             Project: Spark
>          Issue Type: New Feature
>          Components: Deploy, PySpark
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Semet
>              Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy 
> -- create a virtualenv if not already in one: 
> {code} 
> virtualenv env 
> {code} 
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need. 
> - create the wheelhouse for your current project 
> {code} 
> pip install wheelhouse 
> pip wheel . --wheel-dir wheelhouse 
> {code} 
> This can take some times, but at the end you have all the .whl required *for 
> your current system* in a directory {{wheelhouse}}. 
> - zip it into a {{wheelhouse.zip}}. 
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically. 
> Now comes the time to submit the project: 
> {code} 
> bin/spark-submit --master master --deploy-mode client --files 
> /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip 
> --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
> {code} 
> You can see that: 
> - no extra argument is add in the command line. All configuration goes 
> through {{--conf}} argument (this has been directly taken from SPARK-13587). 
> According to the history on spark source code, I guess the goal is to 
> simplify the maintainance of the various command line interface, by avoiding 
> too many specific argument. 
> - The wheelhouse deployment is triggered by the {{\-\-conf 
> "spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} 
> and {{wheelhouse.zip}} are copied through {{--files}}. The names of both 
> files can be changed through {{\-\-conf}} arguments. I guess with a proper 
> documentation this might not be a problem 
> - you still need to define the path to {{requirement.txt}} and 
> {{wheelhouse.zip}} (they will be automatically copied to each node). This is 
> important since this will allow {{pip install}}, running of each node, to 
> pick only the wheels he needs. For example, if you have a package compiled on 
> 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will 
> only select the right one 
> - I have choosen to keep the script at the end of the command line, but for 
> me it is just a launcher script, it can only be 4 lines: 
> {code} 
> /#!/usr/bin/env python        
> from mypackage import run 
> run() 
> {code} 
> - on each node, a new virtualenv is created *at each deployment*. This has a 
> cost, but not so much, since the {{pip install}} will only install wheel, no 
> compilation nor internet connection will be required. The command line for 
> installing the wheel on each node will be like: 
> {code} 
> pip install --no-index --find-links=/path/to/node/wheelhouse -r 
> requirements.txt 
> {code} 
> *advantages* 
> - quick installation, since there is no compilation 
> - no Internet connectivity support, no need mess with the corporate proxy or 
> require a local mirroring of pypi. 
> - package versionning isolation (two spark job can depends on two different 
> version of a given library) 
> *disadvantages* 
> - creating a virtualenv at each execution takes time, not that much but still 
> it can take some seconds 
> - and disk space 
> - slighly more complex to setup than sending a simple python script, but this 
> feature is not lost 
> - support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but 
> one has to send all wheels flavours and ensure pip is able to install in 
> every environment. The complexity of this task is on the hands of the 
> developer and no more on the IT persons! (TMHO, this is an advantage) 
> *Use Case 2: the Spark cluster has access to Pypi or a mirror of Pypi* 
> This is the more elegant situation. The Spark cluster (each node) can install 
> the dependencies of your project independently from the wheels provided by 
> Pypi. Your internal dependencies and your job project can also comes in 
> independent wheel files as well. In this case the workflow is much simpler: 
> - Turn your project into a Python module 
> - write {{requirements.txt}} and {{setup.py}} like in Use Case 1 
> - create the wheel with {{pip wheels}}. But now we will not send *ALL* the 
> dependencies. Only the one that are not on Pypi (current job project, other 
> internal dependencies, etc). 
> - no need to create a wheelhouse. You can still copy the wheels either with 
> {{--py-files}} (will be automatically installed) or inside a wheelhouse named 
> {{wheelhouse.zip}} 
> Deployment becomes: 
> Now comes the time to submit the project: 
> {code} 
> bin/spark-submit --master master --deploy-mode client --files 
> /path/to/project/requirements.txt --py-files 
> /path/to/project/internal_dependency_1.whl,/path/to/project/internal_dependency_2.whl,/path/to/project/current_project.whl
>  --conf "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"; 
> ~/path/to/launcher_script.py 
> {code} 
> or with a wheelhouse that only contains internal dependencies and current 
> project wheels: 
> {code} 
> bin/spark-submit --master master --deploy-mode client --files 
> /path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"; 
> ~/path/to/launcher_script.py 
> {code} 
> or if you want to use the official Pypi or have configured {{pip.conf}} to 
> hit the internal pypi mirror (see doc bellow): 
> {code} 
> bin/spark-submit --master master --deploy-mode client --files 
> /path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf 
> "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
> {code} 
> On each node, the deployment will be done with a command such as: 
> {code} 
> pip install --index-url http://pypi.mycompany.com 
> --find-links=/path/to/node/wheelhouse -r requirements.txt 
> {code} 
> Note: 
> - {{\-\-conf 
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} allows to 
> specify a Pypi mirror, for example a mirror internal to your company network. 
> If not provided, the default Pypi mirror (pypi.python.org) will be requested 
> - to send a wheelhouse, use {{\-\-files}}. To send individual wheels, use 
> {{\-\-py-files}}. With the latter, all wheels will be installed. For multiple 
> architecture cluster, prepare all needed wheels for all architecture and use 
> a wheelhouse archive, this allows {{pip}} to choose the right version of the 
> wheel automatically. 
> *code submission* 
> I already started working on this point, starting by merging the 2 
> mergerequests [#5408|https://github.com/apache/spark/pull/5408] and 
> [#13599|https://github.com/apache/spark/pull/13599] 
> I'll upload a patch asap for review. 
> I see two major interogations: 
> - I don't know that much YARN or MESOS, so I might require some help for the 
> final integration 
> - documentation should really be carefully crafted so users are not lost in 
> all these concepts 
> I really think having this "wheelhouse" support for spark will really helps 
> using, maintaining, and evolving Python scripts on Spark. Python has a rich 
> set of mature libraries Spark should do anythink to help developers easily 
> access and use them in their everyday job. 
> *Important notes about some complex package such as numpy* 
> Numpy is the kind of package that take several minutes to deploy and we want 
> to avoid having all nodes install it each time. Pypi provides several 
> precompiled wheel but it may occurs that the wheel are not right for your 
> platform or the platform fo your cluster. 
> Wheels are *not* cached for pip version < 7.0. From pip v7.0 and +, wheel are 
> automatically cached when built (if needed), so the first installation might 
> take some time, but after the installation will be straight forward.
> On most of my machines, numpy is installed without any compilation thanks to 
> wheels
> *Certificate* 
> pip does not use system ssl certificate. If you use a local pypi mirror 
> behind https with internal certificate, you'll have to setup pypi correctly 
> with the following content in {{~/.pip/pip.conf}}: 
> {code} 
> [global] 
> cert = /path/to/your/internal/certificates.pem 
> {code} 
> First creation might take some times, but pip will automatically cache the 
> wheel for your system in {{~/.cache/pip/wheels}}. You can of course recreate 
> the wheel with {{pip wheel}} or find the wheel in {{~/.cache/pip/wheels}}. 
> You can use {{pip -v install numpy}} to see where it has placed the wheel in 
> cache. 
> If you use Artifactory, you can upload your wheels at a local, central cache 
> that can be shared accross all your slave. See [this 
> documentation|https://www.jfrog.com/confluence/display/RTF/PyPI+Repositories#PyPIRepositories-LocalRepositories]
>  to see how this works. This way, you can insert wheels in this local cache 
> and it will be seens as if it has been uploaded to the official repository 
> (local cache + remote cache can be "merged" into a virtual repository with 
> artifactory) 
> *Set use of internal pypi mirror* 
> Ask your IT to update the {{~/.pip/pip.conf}} of the node to point by default 
> to the internal mirror: 
> {code} 
> [global] 
> ; Low timeout 
> timeout = 20 
> index-url = https://&lt;user&gt;:&lt;pass&gt;@pypi.mycompany.org/ 
> {code} 
> Now, no more need to specify the {{\-\-conf 
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} in your 
> Spark submit command line 
> Note: this will not work when installing package with {{python setup.py 
> install}} syntax. In this case you need to update {{~/.pypirc}} and use the 
> {{-r}} argument. This syntax is not used in spark-submit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to