Hi PySpark Developers,

Cloudpickle is a core part of PySpark, and is originally copied from (and
improved from) picloud. Since then other projects have found cloudpickle
useful and a fork of cloudpickle <https://github.com/cloudpipe/cloudpickle> was
created and is now maintained as its own library
<https://pypi.python.org/pypi/cloudpickle> (with better test coverage and
resulting bug fixes I understand). We've had a few PRs backporting fixes
from the cloudpickle project into Spark's local copy of cloudpickle - how
would people feel about moving to taking an explicit (pinned) dependency on
cloudpickle?

We could add cloudpickle to the setup.py and a requirements.txt file for
users who prefer not to do a system installation of PySpark.

Py4J is maybe even a simpler case, we currently have a zip of py4j in our
repo but could instead have a pinned version required. While we do depend
on a lot of py4j internal APIs, version pinning should be sufficient to
ensure functionality (and simplify the update process).

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau

Reply via email to