Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Reynold Xin
With any dependency update (or refactoring of existing code), I always ask
this question: what's the benefit? In this case it looks like the benefit
is to reduce efforts in backports. Do you know how often we needed to do
those?


On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau  wrote:

> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally copied from (and
> improved from) picloud. Since then other projects have found cloudpickle
> useful and a fork of cloudpickle
>  was created and is now
> maintained as its own library  (with
> better test coverage and resulting bug fixes I understand). We've had a few
> PRs backporting fixes from the cloudpickle project into Spark's local copy
> of cloudpickle - how would people feel about moving to taking an explicit
> (pinned) dependency on cloudpickle?
>
> We could add cloudpickle to the setup.py and a requirements.txt file for
> users who prefer not to do a system installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
> repo but could instead have a pinned version required. While we do depend
> on a lot of py4j internal APIs, version pinning should be sufficient to
> ensure functionality (and simplify the update process).
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Holden Karau
It's a good question. Py4J seems to have been updated 5 times in 2016 and
is a bit involved (from a review point of view verifying the zip file
contents is somewhat tedious).

cloudpickle is a bit difficult to tell since we can have changes to
cloudpickle which aren't correctly tagged as backporting changes from the
fork (and this can take awhile to review since we don't always catch them
right away as being backports).

Another difficulty with looking at backports is that since our review
process for PySpark has historically been on the slow side, changes
benefiting systems like dask or IPython parallel were not backported to
Spark unless they caused serious errors.

I think the key benefits are better test coverage of the forked version of
cloudpickle, using a more standardized packaging of dependencies, simpler
updates of dependencies reduces friction to gaining benefits from other
related projects work - Python serialization really isn't our secret sauce.

If I'm missing any substantial benefits or costs I'd love to know :)

On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin  wrote:

> With any dependency update (or refactoring of existing code), I always ask
> this question: what's the benefit? In this case it looks like the benefit
> is to reduce efforts in backports. Do you know how often we needed to do
> those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau 
> wrote:
>
>> Hi PySpark Developers,
>>
>> Cloudpickle is a core part of PySpark, and is originally copied from (and
>> improved from) picloud. Since then other projects have found cloudpickle
>> useful and a fork of cloudpickle
>>  was created and is now
>> maintained as its own library  
>> (with
>> better test coverage and resulting bug fixes I understand). We've had a few
>> PRs backporting fixes from the cloudpickle project into Spark's local copy
>> of cloudpickle - how would people feel about moving to taking an explicit
>> (pinned) dependency on cloudpickle?
>>
>> We could add cloudpickle to the setup.py and a requirements.txt file for
>> users who prefer not to do a system installation of PySpark.
>>
>> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
>> repo but could instead have a pinned version required. While we do depend
>> on a lot of py4j internal APIs, version pinning should be sufficient to
>> ensure functionality (and simplify the update process).
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-14 Thread Maciej Szymkiewicz
I don't have any strong views, so just to highlight possible issues:

  * Based on different issues I've seen there is a substantial amount of
users which depend on system wide Python installations. As far as I
am aware neither Py4j nor cloudpickle are present in the standard
system repositories in Debian or Red Hat derivatives.
  * Assuming that Spark is committed to supporting Python 2 beyond its
end of life we have to be sure that any external dependency has the
same policy.
  * Py4j is missing from default Anaconda channel. Not a big issue, just
a small annoyance.
  * External dependencies with pinned versions add some overhead to the
development across versions (effectively we may need a separate env
for each major Spark release). I've seen small inconsistencies in
PySpark behavior with different Py4j versions so it is not
completely hypothetical.
  * Adding possible version conflicts. It is probably not a big risk but
something to consider (for example in combination Blaze + Dask +
PySpark).
  * Adding another party user has to trust.


On 02/14/2017 12:22 AM, Holden Karau wrote:
> It's a good question. Py4J seems to have been updated 5 times in 2016
> and is a bit involved (from a review point of view verifying the zip
> file contents is somewhat tedious).
>
> cloudpickle is a bit difficult to tell since we can have changes to
> cloudpickle which aren't correctly tagged as backporting changes from
> the fork (and this can take awhile to review since we don't always
> catch them right away as being backports).
>
> Another difficulty with looking at backports is that since our review
> process for PySpark has historically been on the slow side, changes
> benefiting systems like dask or IPython parallel were not backported
> to Spark unless they caused serious errors.
>
> I think the key benefits are better test coverage of the forked
> version of cloudpickle, using a more standardized packaging of
> dependencies, simpler updates of dependencies reduces friction to
> gaining benefits from other related projects work - Python
> serialization really isn't our secret sauce.
>
> If I'm missing any substantial benefits or costs I'd love to know :)
>
> On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin  > wrote:
>
> With any dependency update (or refactoring of existing code), I
> always ask this question: what's the benefit? In this case it
> looks like the benefit is to reduce efforts in backports. Do you
> know how often we needed to do those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau
> mailto:hol...@pigscanfly.ca>> wrote:
>
> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally
> copied from (and improved from) picloud. Since then other
> projects have found cloudpickle useful and a fork of
> cloudpickle  was
> created and is now maintained as its own library
>  (with better test
> coverage and resulting bug fixes I understand). We've had a
> few PRs backporting fixes from the cloudpickle project into
> Spark's local copy of cloudpickle - how would people feel
> about moving to taking an explicit (pinned) dependency on
> cloudpickle?
>
> We could add cloudpickle to the setup.py and a
> requirements.txt file for users who prefer not to do a system
> installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of
> py4j in our repo but could instead have a pinned version
> required. While we do depend on a lot of py4j internal APIs,
> version pinning should be sufficient to ensure functionality
> (and simplify the update process).
>
> Cheers,
>
> Holden :)
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> 
>
>
>
>
>
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau

-- 
Maciej Szymkiewicz