With any dependency update (or refactoring of existing code), I always ask
this question: what's the benefit? In this case it looks like the benefit
is to reduce efforts in backports. Do you know how often we needed to do
those?


On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally copied from (and
> improved from) picloud. Since then other projects have found cloudpickle
> useful and a fork of cloudpickle
> <https://github.com/cloudpipe/cloudpickle> was created and is now
> maintained as its own library <https://pypi.python.org/pypi/cloudpickle> (with
> better test coverage and resulting bug fixes I understand). We've had a few
> PRs backporting fixes from the cloudpickle project into Spark's local copy
> of cloudpickle - how would people feel about moving to taking an explicit
> (pinned) dependency on cloudpickle?
>
> We could add cloudpickle to the setup.py and a requirements.txt file for
> users who prefer not to do a system installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
> repo but could instead have a pinned version required. While we do depend
> on a lot of py4j internal APIs, version pinning should be sufficient to
> ensure functionality (and simplify the update process).
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Reply via email to