[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

holdenk Thu, 27 Oct 2016 01:43:54 -0700

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/15659


    [WIP][SPARK-1267][SPARK-18129] Allow PySpark to be pip installed

    ## What changes were proposed in this pull request?
    
    This PR aims to provide a pip installable PySpark package. This does a 
bunch of work to copy the jars over and package them with the Python code (to 
prevent challenges from trying to use different versions of the Python code 
with different versions of the JAR). It does not currently publish to PyPI but 
that is the natural follow up (SPARK-18129).
    
    Done:
    * pip installable on conda [manual tested]
    * setup.py installed on a non-pip managed system (RHEL) with YARN [manual 
tested]
    * Automated testing of this (virtualenv)
    * packaging and signing with release-build*
    
    Possible follow up work:
    * release-build update to publish to PyPI (SPARK-18129)
    - figure out who owns the pyspark package name on prod PyPI (is it someone 
with in the project or should we ask PyPI or should we choose a different name 
to publish with like ApachePySpark?)
    * Windows support and or testing ( SPARK-18136 )
    * investigate details of wheel caching and see if we can avoid cleaning the 
wheel cache during our test
    * consider how we want to number our dev/snapshot versions
    
    Explicitly out of scope:
    * Using pip installed PySpark to start a standalone cluster
    * Using pip installed PySpark for non-Python Spark programs
    
    *I've done some work to test release-build locally but as a non-committer 
I've just done local testing.
    
    ## How was this patch tested?
    Automated testing with virtualenv, manual testing with conda, a system wide 
install, and YARN integration.
    
    release-build changes tested locally as a non-committer (no testing of 
upload artifacts to Apache staging websites)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark SPARK-1267-pip-install-pyspark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15659.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15659
    
----
commit 7763f3c6d28a3246b40a849150746a220e03a112
Author: Juliet Hougland <jul...@cloudera.com>
Date:   2016-04-14T14:11:37Z

    Adds setup.py

commit 30debc7e6fa3a502d7991d2dee9cf48a69d92168
Author: Juliet Hougland <jul...@cloudera.com>
Date:   2016-04-14T16:31:01Z

    Fix spacing.

commit 5155531fce49a0915d6a2187d9adaffc70bfa3f3
Author: Juliet Hougland <n...@myemail.com>
Date:   2016-10-12T05:54:36Z

    updUpdate py4j dependency. Add mllib to extas_require, fix some indentation.

commit 2f0bf9b89db9a3a9362b73f2130a2c779fb01a76
Author: Juliet Hougland <n...@myemail.com>
Date:   2016-10-12T06:03:22Z

    Adds MANIFEST.in file.

commit 4c00b989c27bfe883775677cd1d8dfb930c42a51
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-12T16:44:16Z

    Merge branch 'master' into SPARK-1267-pip-install-pyspark

commit 7ff8d0f465360463d1cd3b503d1d5d8aded7e88f
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-12T17:02:53Z

    Start working towards post-2.0 pip installable PypSpark (so including list 
of jars, fix extras_require decl, etc.)

commit 610b9752d33a37c261327536bb581bef20d46fd1
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-16T18:09:17Z

    Merge branch 'master' into SPARK-1267-pip-install-pyspark

commit cb2e06d2e31e113dc29f5212fc9e05ba7d87fa8d
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-16T18:47:52Z

    So MANIFEST and setup can't refer to things above the root of the project, 
so create symlinks so we can package the JARs with it

commit 01f791db9c10378e01321855d33047785ef643b6
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-18T15:47:04Z

    Merge branch 'master' into SPARK-1267-pip-install-pyspark

commit e2e4d1c9f42522db6ec981e6d650855a58150897
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-18T16:14:48Z

    Keep the symlink

commit fb15d7e3e6b3be7c8c69d776649f4d556656f3f0
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-18T17:38:40Z

    Some progress we need to use SDIST but is ok

commit aab7ee4fcd3bb4825a91f5c5a9baace9944c68d0
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-18T20:47:14Z

    Reenable cleanup

commit 5a5762001946959fbcc96f8daf1510166ad5665e
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-19T14:13:50Z

    Try and provide a clear error message when pip installed directly, fix 
symlink farm issue, fix scripts issue, TODO: fix SPARK_HOME and find out why 
JARs aren't ending up in the install

commit 646aa231cc8646b7bde3ec0df455bd64ec48eb00
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-19T22:56:01Z

    Add two scripts

commit 36c9d45e741929d301ef54dadf33ae56a464f479
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-19T23:45:18Z

    package_data doesn't work so well with nested directories so instead add 
pyspark.bin and pyspark.jars packages and set their package dirs as desired, 
make the spark scripts check and see if they are in a pip installed enviroment 
and if SPARK_HOME in unset then resolve it with Python [otherwise use the 
current behaviour]

commit a78754b778c28fe406ac8c60ede7dbea076a19a1
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-20T00:07:15Z

    Use copyfile also check for jars dir too

commit 955e92b556b2af3f22acd78e8b800a44d900cb31
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-20T00:17:26Z

    Check if pip installed when finding the shell file

commit 2d88a40c3c6236715b9fbe3af49dafb0999ccf00
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-20T00:19:40Z

    Check if jars dir exists rather than release file

commit 9e5c5328e42a462b0f76a2ebad989dfa5b5dcdd5
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-23T15:52:48Z

    Start working a bit on the docs

commit be7eadd1af3bc26e952f732d1fb4433bc6dd94e3
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-23T15:53:27Z

    Merge branch 'master' into SPARK-1267-pip-install-pyspark

commit 07d384982caa069e96cc2ac64b9faa9dc19ddc00
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-23T21:22:59Z

    Try and include pyspark zip file for yarn use

commit 11b5fa85cbaed0866455a28e88f7868428c36219
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-23T23:46:28Z

    Copy pyspark zip for use in yarn cluster mode

commit 8791f829469f163ff195647d6250bee6f53d0dc4
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-24T12:56:06Z

    Start adding scripts to test pip installability

commit 92837a3a561cf96746c795d11aa60c2e82e6fa2d
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-24T13:40:05Z

    Works on yarn, works with spark submit, still need to fix import based 
spark home finder

commit 6947a855f5567eba80b6c3a9cfe97a3fc53fe863
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-24T14:00:00Z

    Start updating find-spark-home to be available in many cases.

commit 944160cabbaa96ed00a3d6ff4b7ddff9d29d204a
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-24T14:08:51Z

    Use Switch to find_spark_home.py

commit 5bf0746dea5db4421a6ae8edc96de6d567f460e3
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-24T14:09:03Z

    Move to under pyspark

commit 435f8427a6ca5bdfae25ba439822e44b7fd4eff4
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-24T14:13:12Z

    Update to py4j 0.10.4 in the deps, also switch how we are copying 
find_spark_home.py around

commit 27ca27eda451cc4edbdb1811bef4c07bdafc98ef
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-24T14:16:59Z

    Update java gateway to use _find_spark_home function, add quick sanity 
check file

commit df126cf219b9367792e9a25b7d3493b7a060daee
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-10-24T14:45:23Z

    Lint fixes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

Reply via email to