Hi PySpark Developers, Cloudpickle is a core part of PySpark, and is originally copied from (and improved from) picloud. Since then other projects have found cloudpickle useful and a fork of cloudpickle <https://github.com/cloudpipe/cloudpickle> was created and is now maintained as its own library <https://pypi.python.org/pypi/cloudpickle> (with better test coverage and resulting bug fixes I understand). We've had a few PRs backporting fixes from the cloudpickle project into Spark's local copy of cloudpickle - how would people feel about moving to taking an explicit (pinned) dependency on cloudpickle?
We could add cloudpickle to the setup.py and a requirements.txt file for users who prefer not to do a system installation of PySpark. Py4J is maybe even a simpler case, we currently have a zip of py4j in our repo but could instead have a pinned version required. While we do depend on a lot of py4j internal APIs, version pinning should be sufficient to ensure functionality (and simplify the update process). Cheers, Holden :) -- Twitter: https://twitter.com/holdenkarau