[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375005#comment-15375005 ]
Semet edited comment on SPARK-16367 at 7/13/16 1:48 PM: -------------------------------------------------------- I have sent a design doc and pullrequest. They are deliberately based on [#13599|https://github.com/apache/spark/pull/13599] and [design doc from SPARK-13587|https://docs.google.com/document/d/1MpURTPv0xLvIWhcJdkc5lDMWYBRJ4zAQ69rP2WA8-TM/edit?usp=sharing]. Pull Request: [#14180|https://github.com/apache/spark/pull/14180] Design Doc: [Wheel and Virtualenv support|https://docs.google.com/document/d/1oXN7c2xE42-MHhuGqt_i7oeIjAwBI0E9phoNuSfR5Bs/edit?usp=sharing] was (Author: gae...@xeberon.net): I have sent a design doc and pullrequest. They are deliberately based on [#13599|https://github.com/apache/spark/pull/13599] and [design doc from SPARK-13587|https://docs.google.com/document/d/1MpURTPv0xLvIWhcJdkc5lDMWYBRJ4zAQ69rP2WA8-TM/edit?usp=sharing]. Pull Request: [#14180|https://github.com/apache/spark/pull/14180] > Wheelhouse Support for PySpark > ------------------------------ > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark > Affects Versions: 1.6.1, 1.6.2, 2.0.0 > Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use > [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of > maitaining a setup.py files really easy > -- create a virtualenv if not already in one: > {code} > virtualenv env > {code} > -- Work on your environment, define the requirement you need in > {{requirements.txt}}, do all the {{pip install}} you need. > - create the wheelhouse for your current project > {code} > pip install wheelhouse > pip wheel . --wheel-dir wheelhouse > {code} > This can take some times, but at the end you have all the .whl required *for > your current system* in a directory {{wheelhouse}}. > - zip it into a {{wheelhouse.zip}}. > Note that you can have your own package (for instance 'my_package') be > generated into a wheel and so installed by {{pip}} automatically. > Now comes the time to submit the project: > {code} > bin/spark-submit --master master --deploy-mode client --files > /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip > --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py > {code} > You can see that: > - no extra argument is add in the command line. All configuration goes > through {{--conf}} argument (this has been directly taken from SPARK-13587). > According to the history on spark source code, I guess the goal is to > simplify the maintainance of the various command line interface, by avoiding > too many specific argument. > - The wheelhouse deployment is triggered by the {{\-\-conf > "spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} > and {{wheelhouse.zip}} are copied through {{--files}}. The names of both > files can be changed through {{\-\-conf}} arguments. I guess with a proper > documentation this might not be a problem > - you still need to define the path to {{requirement.txt}} and > {{wheelhouse.zip}} (they will be automatically copied to each node). This is > important since this will allow {{pip install}}, running of each node, to > pick only the wheels he needs. For example, if you have a package compiled on > 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will > only select the right one > - I have choosen to keep the script at the end of the command line, but for > me it is just a launcher script, it can only be 4 lines: > {code} > /#!/usr/bin/env python > from mypackage import run > run() > {code} > - on each node, a new virtualenv is created *at each deployment*. This has a > cost, but not so much, since the {{pip install}} will only install wheel, no > compilation nor internet connection will be required. The command line for > installing the wheel on each node will be like: > {code} > pip install --no-index --find-links=/path/to/node/wheelhouse -r > requirements.txt > {code} > *advantages* > - quick installation, since there is no compilation > - no Internet connectivity support, no need mess with the corporate proxy or > require a local mirroring of pypi. > - package versionning isolation (two spark job can depends on two different > version of a given library) > *disadvantages* > - creating a virtualenv at each execution takes time, not that much but still > it can take some seconds > - and disk space > - slighly more complex to setup than sending a simple python script, but this > feature is not lost > - support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but > one has to send all wheels flavours and ensure pip is able to install in > every environment. The complexity of this task is on the hands of the > developer and no more on the IT persons! (TMHO, this is an advantage) > *Use Case 2: the Spark cluster has access to Pypi or a mirror of Pypi* > This is the more elegant situation. The Spark cluster (each node) can install > the dependencies of your project independently from the wheels provided by > Pypi. Your internal dependencies and your job project can also comes in > independent wheel files as well. In this case the workflow is much simpler: > - Turn your project into a Python module > - write {{requirements.txt}} and {{setup.py}} like in Use Case 1 > - create the wheel with {{pip wheels}}. But now we will not send *ALL* the > dependencies. Only the one that are not on Pypi (current job project, other > internal dependencies, etc). > - no need to create a wheelhouse. You can still copy the wheels either with > {{--py-files}} (will be automatically installed) or inside a wheelhouse named > {{wheelhouse.zip}} > Deployment becomes: > Now comes the time to submit the project: > {code} > bin/spark-submit --master master --deploy-mode client --files > /path/to/project/requirements.txt --py-files > /path/to/project/internal_dependency_1.whl,/path/to/project/internal_dependency_2.whl,/path/to/project/current_project.whl > --conf "spark.pyspark.virtualenv.enabled=true" --conf > "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/" > ~/path/to/launcher_script.py > {code} > or with a wheelhouse that only contains internal dependencies and current > project wheels: > {code} > bin/spark-submit --master master --deploy-mode client --files > /path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf > "spark.pyspark.virtualenv.enabled=true" --conf > "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/" > ~/path/to/launcher_script.py > {code} > or if you want to use the official Pypi or have configured {{pip.conf}} to > hit the internal pypi mirror (see doc bellow): > {code} > bin/spark-submit --master master --deploy-mode client --files > /path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf > "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py > {code} > On each node, the deployment will be done with a command such as: > {code} > pip install --index-url http://pypi.mycompany.com > --find-links=/path/to/node/wheelhouse -r requirements.txt > {code} > Note: > - {{\-\-conf > "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} allows to > specify a Pypi mirror, for example a mirror internal to your company network. > If not provided, the default Pypi mirror (pypi.python.org) will be requested > - to send a wheelhouse, use {{\-\-files}}. To send individual wheels, use > {{\-\-py-files}}. With the latter, all wheels will be installed. For multiple > architecture cluster, prepare all needed wheels for all architecture and use > a wheelhouse archive, this allows {{pip}} to choose the right version of the > wheel automatically. > *code submission* > I already started working on this point, starting by merging the 2 > mergerequests [#5408|https://github.com/apache/spark/pull/5408] and > [#13599|https://github.com/apache/spark/pull/13599] > I'll upload a patch asap for review. > I see two major interogations: > - I don't know that much YARN or MESOS, so I might require some help for the > final integration > - documentation should really be carefully crafted so users are not lost in > all these concepts > I really think having this "wheelhouse" support for spark will really helps > using, maintaining, and evolving Python scripts on Spark. Python has a rich > set of mature libraries Spark should do anythink to help developers easily > access and use them in their everyday job. > *Important notes about some complex package such as numpy* > Numpy is the kind of package that take several minutes to deploy and we want > to avoid having all nodes install it each time. Pypi provides several > precompiled wheel but it may occurs that the wheel are not right for your > platform or the platform fo your cluster. > Wheels are *not* cached for pip version < 7.0. From pip v7.0 and +, wheel are > automatically cached when built (if needed), so the first installation might > take some time, but after the installation will be straight forward. > On most of my machines, numpy is installed without any compilation thanks to > wheels > *Certificate* > pip does not use system ssl certificate. If you use a local pypi mirror > behind https with internal certificate, you'll have to setup pypi correctly > with the following content in {{~/.pip/pip.conf}}: > {code} > [global] > cert = /path/to/your/internal/certificates.pem > {code} > First creation might take some times, but pip will automatically cache the > wheel for your system in {{~/.cache/pip/wheels}}. You can of course recreate > the wheel with {{pip wheel}} or find the wheel in {{~/.cache/pip/wheels}}. > You can use {{pip -v install numpy}} to see where it has placed the wheel in > cache. > If you use Artifactory, you can upload your wheels at a local, central cache > that can be shared accross all your slave. See [this > documentation|https://www.jfrog.com/confluence/display/RTF/PyPI+Repositories#PyPIRepositories-LocalRepositories] > to see how this works. This way, you can insert wheels in this local cache > and it will be seens as if it has been uploaded to the official repository > (local cache + remote cache can be "merged" into a virtual repository with > artifactory) > *Set use of internal pypi mirror* > Ask your IT to update the {{~/.pip/pip.conf}} of the node to point by default > to the internal mirror: > {code} > [global] > ; Low timeout > timeout = 20 > index-url = https://<user>:<pass>@pypi.mycompany.org/ > {code} > Now, no more need to specify the {{\-\-conf > "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} in your > Spark submit command line > Note: this will not work when installing package with {{python setup.py > install}} syntax. In this case you need to update {{~/.pypirc}} and use the > {{-r}} argument. This syntax is not used in spark-submit -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org