GitHub user holdenk opened a pull request: https://github.com/apache/spark/pull/15659
[WIP][SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: * pip installable on conda [manual tested] * setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] * Automated testing of this (virtualenv) * packaging and signing with release-build* Possible follow up work: * release-build update to publish to PyPI (SPARK-18129) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) * Windows support and or testing ( SPARK-18136 ) * investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test * consider how we want to number our dev/snapshot versions Explicitly out of scope: * Using pip installed PySpark to start a standalone cluster * Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) You can merge this pull request into a Git repository by running: $ git pull https://github.com/holdenk/spark SPARK-1267-pip-install-pyspark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15659.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15659 ---- commit 7763f3c6d28a3246b40a849150746a220e03a112 Author: Juliet Hougland <jul...@cloudera.com> Date: 2016-04-14T14:11:37Z Adds setup.py commit 30debc7e6fa3a502d7991d2dee9cf48a69d92168 Author: Juliet Hougland <jul...@cloudera.com> Date: 2016-04-14T16:31:01Z Fix spacing. commit 5155531fce49a0915d6a2187d9adaffc70bfa3f3 Author: Juliet Hougland <n...@myemail.com> Date: 2016-10-12T05:54:36Z updUpdate py4j dependency. Add mllib to extas_require, fix some indentation. commit 2f0bf9b89db9a3a9362b73f2130a2c779fb01a76 Author: Juliet Hougland <n...@myemail.com> Date: 2016-10-12T06:03:22Z Adds MANIFEST.in file. commit 4c00b989c27bfe883775677cd1d8dfb930c42a51 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-12T16:44:16Z Merge branch 'master' into SPARK-1267-pip-install-pyspark commit 7ff8d0f465360463d1cd3b503d1d5d8aded7e88f Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-12T17:02:53Z Start working towards post-2.0 pip installable PypSpark (so including list of jars, fix extras_require decl, etc.) commit 610b9752d33a37c261327536bb581bef20d46fd1 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-16T18:09:17Z Merge branch 'master' into SPARK-1267-pip-install-pyspark commit cb2e06d2e31e113dc29f5212fc9e05ba7d87fa8d Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-16T18:47:52Z So MANIFEST and setup can't refer to things above the root of the project, so create symlinks so we can package the JARs with it commit 01f791db9c10378e01321855d33047785ef643b6 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-18T15:47:04Z Merge branch 'master' into SPARK-1267-pip-install-pyspark commit e2e4d1c9f42522db6ec981e6d650855a58150897 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-18T16:14:48Z Keep the symlink commit fb15d7e3e6b3be7c8c69d776649f4d556656f3f0 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-18T17:38:40Z Some progress we need to use SDIST but is ok commit aab7ee4fcd3bb4825a91f5c5a9baace9944c68d0 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-18T20:47:14Z Reenable cleanup commit 5a5762001946959fbcc96f8daf1510166ad5665e Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-19T14:13:50Z Try and provide a clear error message when pip installed directly, fix symlink farm issue, fix scripts issue, TODO: fix SPARK_HOME and find out why JARs aren't ending up in the install commit 646aa231cc8646b7bde3ec0df455bd64ec48eb00 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-19T22:56:01Z Add two scripts commit 36c9d45e741929d301ef54dadf33ae56a464f479 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-19T23:45:18Z package_data doesn't work so well with nested directories so instead add pyspark.bin and pyspark.jars packages and set their package dirs as desired, make the spark scripts check and see if they are in a pip installed enviroment and if SPARK_HOME in unset then resolve it with Python [otherwise use the current behaviour] commit a78754b778c28fe406ac8c60ede7dbea076a19a1 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-20T00:07:15Z Use copyfile also check for jars dir too commit 955e92b556b2af3f22acd78e8b800a44d900cb31 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-20T00:17:26Z Check if pip installed when finding the shell file commit 2d88a40c3c6236715b9fbe3af49dafb0999ccf00 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-20T00:19:40Z Check if jars dir exists rather than release file commit 9e5c5328e42a462b0f76a2ebad989dfa5b5dcdd5 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-23T15:52:48Z Start working a bit on the docs commit be7eadd1af3bc26e952f732d1fb4433bc6dd94e3 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-23T15:53:27Z Merge branch 'master' into SPARK-1267-pip-install-pyspark commit 07d384982caa069e96cc2ac64b9faa9dc19ddc00 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-23T21:22:59Z Try and include pyspark zip file for yarn use commit 11b5fa85cbaed0866455a28e88f7868428c36219 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-23T23:46:28Z Copy pyspark zip for use in yarn cluster mode commit 8791f829469f163ff195647d6250bee6f53d0dc4 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-24T12:56:06Z Start adding scripts to test pip installability commit 92837a3a561cf96746c795d11aa60c2e82e6fa2d Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-24T13:40:05Z Works on yarn, works with spark submit, still need to fix import based spark home finder commit 6947a855f5567eba80b6c3a9cfe97a3fc53fe863 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-24T14:00:00Z Start updating find-spark-home to be available in many cases. commit 944160cabbaa96ed00a3d6ff4b7ddff9d29d204a Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-24T14:08:51Z Use Switch to find_spark_home.py commit 5bf0746dea5db4421a6ae8edc96de6d567f460e3 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-24T14:09:03Z Move to under pyspark commit 435f8427a6ca5bdfae25ba439822e44b7fd4eff4 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-24T14:13:12Z Update to py4j 0.10.4 in the deps, also switch how we are copying find_spark_home.py around commit 27ca27eda451cc4edbdb1811bef4c07bdafc98ef Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-24T14:16:59Z Update java gateway to use _find_spark_home function, add quick sanity check file commit df126cf219b9367792e9a25b7d3493b7a060daee Author: Holden Karau <hol...@us.ibm.com> Date: 2016-10-24T14:45:23Z Lint fixes ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org