[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. This ticket proposes to allow users the ability to deploy their job as "Wheels" packages. The Python community is strongly advocating to promote this way of packaging and distributing Python application as a "standard way of deploying Python App". In other word, this is the "Pythonic Way of Deployment". *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6 # via pylint autopep8==1.2.4 click==6.6 # via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31 # via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. This ticket proposes to allow users the ability to deploy their job as "Wheels" packages. The Python community is strongly advocating to promote this way of packaging and distributing Python application as a "standard way of deploying Python App". In other word, this is the "Pythonic Way of Deployment". *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6 # via pylint autopep8==1.2.4 click==6.6 # via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31 # via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6 # via pylint autopep8==1.2.4 click==6.6 # via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31 # via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6 # via pylint autopep8==1.2.4 click==6.6 # via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31 # via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally), for example in companies with a weird internet proxy settings or if you want to protect your spark cluster from the web. {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally), for example in companies with a weird internet proxy settings or if you want to protect your spark cluster from the web. {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code}
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally) {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Uber Fat Wheelhouse for Python Deployment* In Python, the packaging standard is now "wheels", which goes further that old good ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheel, each specific to an architecture, or environment. The {{pip}} tools now how to select the package matching the current system, how to install this package in a light speed. Said otherwise, package that requires compilation of a C module, for instance, does *not* compile anything when installing from wheel file. {{pip}} also provides the ability to generate easily all wheel of all packages used for a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Developer workflow* Here is, in a more concrete way, my proposal for on Pyspark developers point of view: - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --files /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too many specific argument. - The wheelhouse deployment is triggered by the {{ --conf "spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} and {{wheelhouse.zip}} are copied through {{--files}}. The names of both files can be changed through {{--conf}} arguments. I guess with a proper documentation this might not be a problem - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will be
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Labels: newbie wh (was: newbie) > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Uber Fat Wheelhouse for Python Deployment* > In Python, the packaging standard is now "wheels", which goes further that > old good ".egg" files. With a wheel file (".whl"), the package is already > prepared for a given architecture. You can have several wheel, each specific > to an architecture, or environment. > The {{pip}} tools now how to select the package matching the current system, > how to install this package in a light speed. Said otherwise, package that > requires compilation of a C module, for instance, does *not* compile anything > when installing from wheel file. > {{pip}} also provides the ability to generate easily all wheel of all > packages used for a given module (inside a "virtualenv"). This is called > "wheelhouse". You can even don't mess with this compilation and retrieve it > directly from pypi.python.org. > *Developer workflow* > Here is, in a more concrete way, my proposal for on Pyspark developers point > of view: > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6# via pylint > autopep8==1.2.4 > click==6.6# via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31# via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use > [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of > maitaining a setup.py files really easy > -- create a virtualenv if not already in one: > {code} > virtualenv env > {code} > -- Work on your environment, define the requirement you need in > {{requirements.txt}}, do all the {{pip install}} you need. > - create the wheelhouse for your current project > {code} > pip install wheelhouse > pip wheel . --wheel-dir wheelhouse > {code} > This can take some times, but at the end you have all the .whl required *for > your current system* > - zip it into a {{wheelhouse.zip}}. > Note that you can have your own package (for instance 'my_package') be > generated into a wheel and so installed by {{pip}} automatically. > Now comes the time to submit the project: > {code} > bin/spark-submit --master master --deploy-mode client --conf > "spark.pyspark.virtualenv.enabled=true" --conf > "spark.pyspark.virtualenv.type=native" --conf > "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" > --conf "spark.pyspark.virtualenv.bin.path=virtualenv" >
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Labels: newbie python python-wheel wheelhouse (was: newbie wh) > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Uber Fat Wheelhouse for Python Deployment* > In Python, the packaging standard is now "wheels", which goes further that > old good ".egg" files. With a wheel file (".whl"), the package is already > prepared for a given architecture. You can have several wheel, each specific > to an architecture, or environment. > The {{pip}} tools now how to select the package matching the current system, > how to install this package in a light speed. Said otherwise, package that > requires compilation of a C module, for instance, does *not* compile anything > when installing from wheel file. > {{pip}} also provides the ability to generate easily all wheel of all > packages used for a given module (inside a "virtualenv"). This is called > "wheelhouse". You can even don't mess with this compilation and retrieve it > directly from pypi.python.org. > *Developer workflow* > Here is, in a more concrete way, my proposal for on Pyspark developers point > of view: > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6# via pylint > autopep8==1.2.4 > click==6.6# via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31# via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use > [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of > maitaining a setup.py files really easy > -- create a virtualenv if not already in one: > {code} > virtualenv env > {code} > -- Work on your environment, define the requirement you need in > {{requirements.txt}}, do all the {{pip install}} you need. > - create the wheelhouse for your current project > {code} > pip install wheelhouse > pip wheel . --wheel-dir wheelhouse > {code} > This can take some times, but at the end you have all the .whl required *for > your current system* > - zip it into a {{wheelhouse.zip}}. > Note that you can have your own package (for instance 'my_package') be > generated into a wheel and so installed by {{pip}} automatically. > Now comes the time to submit the project: > {code} > bin/spark-submit --master master --deploy-mode client --conf > "spark.pyspark.virtualenv.enabled=true" --conf > "spark.pyspark.virtualenv.type=native" --conf > "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" > --conf "spark.pyspark.virtualenv.bin.path=virtualenv" >
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Uber Fat Wheelhouse for Python Deployment* In Python, the packaging standard is now "wheels", which goes further that old good ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheel, each specific to an architecture, or environment. The {{pip}} tools now how to select the package matching the current system, how to install this package in a light speed. Said otherwise, package that requires compilation of a C module, for instance, does *not* compile anything when installing from wheel file. {{pip}} also provides the ability to generate easily all wheel of all packages used for a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Developer workflow* Here is, in a more concrete way, my proposal for on Pyspark developers point of view: - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too many specific argument. - the command line is pretty complex indeed. I guess with a proper documentation this might not be a problem - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will be automatically copied to each node). This
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Issue Type: New Feature (was: Improvement) > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet > Labels: newbie > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > So here is my proposal: > *Uber Fat Wheelhouse for Python Deployment* > In Python, the packaging standard is now "wheels", which goes further that > old good ".egg" files. With a wheel file (".whl"), the package is already > prepared for a given architecture. You can have several wheel, each specific > to an architecture, or environment. > The {{pip}} tools now how to select the package matching the current system, > how to install this package in a light speed. Said otherwise, package that > requires compilation of a C module, for instance, does *not* compile anything > when installing from wheel file. > {{pip}} also provides the ability to generate easily all wheel of all > packages used for a given module (inside a "virtualenv"). This is called > "wheelhouse". You can even don't mess with this compilation and retrieve it > directly from pypi.python.org. > *Developer workflow* > Here is, in a more concrete way, my proposal for on Pyspark developers point > of view: > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6# via pylint > autopep8==1.2.4 > click==6.6# via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31# via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use > [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of > maitaining a setup.py files really easy > -- create a virtualenv if not already in one: > {code} > virtualenv env > {code} > -- Work on your environment, define the requirement you need in > {{requirements.txt}}, do all the {{pip install}} you need. > - create the wheelhouse for your current project > {code} > pip install wheelhouse > pip wheel . --wheel-dir wheelhouse > {code} > This can take some times, but at the end you have all the .whl required *for > your current system* > - zip it into a {{wheelhouse.zip}}. > Note that you can have your own package (for instance 'my_package') be > generated into a wheel and so installed by {{pip}} automatically. > Now comes the time to submit the project: > {code} > bin/spark-submit --master master --deploy-mode client --conf > "spark.pyspark.virtualenv.enabled=true" --conf > "spark.pyspark.virtualenv.type=native" --conf > "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" > --conf "spark.pyspark.virtualenv.bin.path=virtualenv" > "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip" > ~/path/to/launcher_script.py > {code} > You can see that: > - no
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") So here is my proposal: *Uber Fat Wheelhouse for Python Deployment* In Python, the packaging standard is now "wheels", which goes further that old good ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheel, each specific to an architecture, or environment. The {{pip}} tools now how to select the package matching the current system, how to install this package in a light speed. Said otherwise, package that requires compilation of a C module, for instance, does *not* compile anything when installing from wheel file. {{pip}} also provides the ability to generate easily all wheel of all packages used for a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Developer workflow* Here is, in a more concrete way, my proposal for on Pyspark developers point of view: - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too many specific argument. - the command line is pretty complex indeed. I guess with a proper documentation this might not be a problem - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will be automatically copied to each node). This is important since this will allow {{pip install}}, running of each node,
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") So here is my proposal: *Uber Fat Wheelhouse for Python Deployment* In Python, the packaging standard is now "wheels", which goes further that old good ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheel, each specific to an architecture, or environment. The {{pip}} tools now how to select the package matching the current system, how to install this package in a light speed. Said otherwise, package that requires compilation of a C module, for instance, does *not* compile anything when installing from wheel file. {{pip}} also provides the ability to generate easily all wheel of all packages used for a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Developer workflow* Here is, in a more concrete way, my proposal for on Pyspark developers point of view: - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too many specific argument. - the command line is pretty complex indeed. I guess with a proper documentation this might not be a problem - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will be automatically copied to each node). This is important since this will allow {{pip install}}, running of each node,
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") So here is my proposal: *Uber Fat Wheelhouse for Python Deployment* In Python, the packaging standard is now "wheels", which goes further that old good ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheel, each specific to an architecture, or environment. The {{pip}} tools now how to select the package matching the current system, how to install this package in a light speed. Said otherwise, package that requires compilation of a C module, for instance, does *not* compile anything when installing from wheel file. {{pip}} also provides the ability to generate easily all wheel of all packages used for a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Developer workflow* Here is, in a more concrete way, my proposal for on Pyspark developers point of view: - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too many specific argument. - the command line is pretty complex indeed. I guess with a proper documentation this might not be a problem - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will be automatically copied to each node). This is important since this will allow {{pip install}}, running of each node,
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") So here is my proposal: *Uber Fat Wheelhouse for Python Deployment* In Python, the packaging standard is now "wheels", which goes further that old good ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheel, each specific to an architecture, or environment. The {{pip}} tools now how to select the package matching the current system, how to install this package in a light speed. Said otherwise, package that requires compilation of a C module, for instance, does *not* compile anything when installing from wheel file. {{pip}} also provides the ability to generate easily all wheel of all packages used for a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Developer workflow* Here is, in a more concrete way, my proposal for on Pyspark developers point of view: - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too many specific argument. - the command line is pretty complex indeed. I guess with a proper documentation this might not be a problem - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will be automatically copied to each node). This is important since this will allow {{pip install}}, running of each node, to pick only the wheels he needs. For example, if you have a package compiled on 32 bits and 64
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") So here is my proposal: *Uber Fat Wheelhouse for Python Deployment* In Python, the packaging standard is now "wheels", which goes further that old good ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheel, each specific to an architecture, or environment. The {{pip}} tools now how to select the package matching the current system, how to install this package in a light speed. Said otherwise, package that requires compilation of a C module, for instance, does *not* compile anything when installing from wheel file. {{pip}} also provides the ability to generate easily all wheel of all packages used for a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Developer workflow* Here is, in a more concrete way, how my proposal will be for developers: - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a requirements.txt. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too many specific argument. - the command line is pretty complex indeed. I guess with a proper documentation this might not be a problem - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will be automatically copied to each node). This is important since this will allow {{pip install}}, running of each node, to pick only the wheels he needs. For example, if you have a package compiled on 32 bits and 64 bits, you will have
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") So here is my proposal: *Uber Fat Wheelhouse for Python Deployment* In Python, the packaging standard is now "wheels", which goes further that old good ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheel, each specific to an architecture, or environment. The {{pip}} tools now how to select the package matching the current system, how to install this package in a light speed. Said otherwise, package that requires compilation of a C module, for instance, does *not* compile anything when installing from wheel file. {{pip}} also provides the ability to generate easily all wheel of all packages used for a given module (inside a "virtualenv"). This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Developer workflow* Here is, in a more concrete way, how my proposal will be for developers: - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6# via pylint autopep8==1.2.4 click==6.6# via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31# via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --master master --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" --conf "spark.pyspark.virtualenv.bin.path=virtualenv" "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip" ~/path/to/launcher_script.py {code} You can see that: - no extra argument is add in the command line. All configuration goes through {{--conf}} argument (this has been directly taken from SPARK-13587). According to the history on spark source code, I guess the goal is to simplify the maintainance of the various command line interface, by avoiding too many specific argument. - the command line is pretty complex indeed. I guess with a proper documentation this might not be a problem - you still need to define the path to {{requirement.txt}} and {{wheelhouse.zip}} (they will be automatically copied to each node). This is important since this will allow {{pip install}}, running of each node, to pick only the wheels he needs. For example, if you have a package compiled on 32 bits and 64 bits, you will