Re: Consider Cloudpickle instead of dill for Python pickling

Ahmet Altay Mon, 03 May 2021 16:11:43 -0700

I agree with Robert on this one. With the exception of DillCoder, it might
be reasonable to conditionally support both. (On a related note, I only see
one use of DillCoder, do we really need that coder?)


On Mon, May 3, 2021 at 5:01 AM Kenneth Knowles <[email protected]> wrote:

> My 2 contradictory cents from even further back in the peanut gallery:
>
>  - Pickle/dill/cloudpickle/etc are most suitable for transmission, not
> storage, so changing is a lesser breaking change. But there still might be
> streaming pipelines that are using it can cannot be updated.
>

I believe the problem arises from transmission between a client and a
worker when they are not using compatible libraries. Much history here (
https://github.com/uqfoundation/dill/issues/341) but I do not think dill is
necessarily at fault here. I think we will need to ensure using compatible
libraries in both environments.


>  - A serialization library with an unstable/breaking serialization format
> is not production-ready. If open version ranges are unsafe, that is an
> indication that it is not ready for use.
>

It was a bug and it happened twice according to the maintainer. (See the
issue above). I am not sure it will be much better with a different library.


>  - We should use whatever the rest of the world uses. That is more
> important than either of the above two points.
>

Both are similarly popular. Looks like dill is a bit (50%) more popular (
https://libraries.io/pypi/dill vs https://libraries.io/pypi/cloudpickle)
and has a more recent release.


>
> Kenn
>
> On Sat, May 1, 2021 at 5:15 AM Jarek Potiuk <[email protected]> wrote:
>
>> Just my 2 cents  comment from the users perspective.
>>
>> In Airflow, the narrow limits of `dill` caused some problems with
>> dependencies. We had to add some exceptions in our process for that:
>> https://github.com/apache/airflow/blob/master/Dockerfile#L246
>> https://github.com/apache/airflow/blob/master/Dockerfile.ci#L271  - so
>> the problem is largely solved for now, but if dill would be used by any
>> different library it could be a problem. I imagine cloudpickle is more
>> frequently used than dill, so it might become a problem if those
>> dependencies are narrowly defined.
>>
>> Currently cloudpickle for Airflow is already pulled in by
>> Dask's  "distributed"  library (but they have just > limits there):
>>
>> distributed==2.19.0
>>   - click [required: >=6.6, installed: 7.1.2]
>>   - cloudpickle [required: >=1.3.0, installed: 1.4.1]
>>   - dask [required: >=2.9.0, installed: 2021.4.1]
>>     - cloudpickle [required: >=1.1.1, installed: 1.4.1]
>>     - fsspec [required: >=0.6.0, installed: 2021.4.0]
>>
>> However, I have a better idea - why don't you simply vendor-in either
>> `dill` or `cloudpickle` (I am not sure which one is best) ?
>>
>> Since you are not planning to upgrade it often (that's the whole point of
>> narrow versioning), you can have the best of both worlds - stable version
>> used in both client/server AND you would not be limiting others.
>>
>> J.
>>
>>
>> On Fri, Apr 30, 2021 at 9:42 PM Stephan Hoyer <[email protected]> wrote:
>>
>>> Glad to hear this is something you've open to and in fact have already
>>> considered :)
>>>
>>> I may give implementing this a try, though I'm not familiar with how
>>> configuration options are managed in Beam, so that may be easier for a core
>>> developer to deal with.
>>>
>>> On Fri, Apr 30, 2021 at 10:58 AM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> As I've mentioned before, I would be in favor of moving to cloudpicke,
>>>> first as an option, and if that works out well as the default. In
>>>> particular, pickling functions from the main session in a hermetic (if
>>>> sometimes slightly bulkier way) way as opposed to the main session
>>>> pickling gymnastics is far preferable (especially for interactive).
>>>>
>>>> Versioning is an issue in general, and a tradeoff between the
>>>> overheads of re-building the worker every time (either custom
>>>> containers or at runtime) vs. risking different versions, and we could
>>>> possibly do better more generally on both fronts (as well as making
>>>> this tradeoff clear). Fair point that Cloudpickle is less likely to
>>>> just work with pinning. On the other hand, Cloudpickle looks fairly
>>>> mature/stable at this point, so hopefully it wouldn't be too hard to
>>>> keep our containers closet to head. If there is an error, we could
>>>> consider catching it and raising a more explicit message about the
>>>> version things were pickled vs. unpickled with.
>>>>
>>>> I would welcome as a first step a PR that conditionally allows the use
>>>> of CloudPickle in the place of Dill (with the exception of DillCoder,
>>>> there should of course probably be a separate CloudPickleCoder).
>>>>
>>>> On Fri, Apr 30, 2021 at 10:17 AM Valentyn Tymofieiev
>>>> <[email protected]> wrote:
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Apr 30, 2021 at 9:53 AM Brian Hulette <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> > I think with cloudpickle we will not be able have a tight range.
>>>> >>
>>>> >> If cloudpickle is backwards compatible, we should be able to just
>>>> keep an upper bound in setup.py [1] synced up with a pinned version in
>>>> base_image_requirements.txt [2], right?
>>>> >
>>>> >
>>>> > With an upper bound only, dependency resolver could still downgrade
>>>> pickler on the runner' side, ideally we should be detecting that.
>>>> >
>>>> > Also if we ever depend on a newer functionality, we would add a lower
>>>> bound as well, which (for that particular Beam release), makes it a tight
>>>> bound, so potentially a friction point.
>>>> >
>>>> >>
>>>> >>
>>>> >> > We could solve this problem by passing the version of pickler used
>>>> at job submission
>>>> >>
>>>> >> A bit of a digression, but it may be worth considering something
>>>> more general here, for a couple of reasons:
>>>> >> - I've had a similar concern for the Beam DataFrame API. Our goal is
>>>> for it to match the behavior of the pandas version used at construction
>>>> time, but we could get into some surprising edge cases if the version of
>>>> pandas used to compute partial results in the SDK harness is different.
>>>> >> - Occasionally we have Dataflow customers report
>>>> NameErrors/AttributeErrors that can be attributed to a dependency mismatch.
>>>> It would be nice to proactively warn about this.
>>>> >>
>>>> >>
>>>> >> That being said I imagine it would be hard to do something truly
>>>> general since every dependency will have different compatibility 
>>>> guarantees.
>>>> >>
>>>> > I think it should be considered a best practice to have matching
>>>> dependencies on job submission and execution side. We can:
>>>> > 1)  consider sending a manifest of all locally installed dependencies
>>>> to the runner and verify on the runner's side that critical dependencies
>>>> are compatible.
>>>> > 2) help make it easier to ensure the dependencies match:
>>>> >   - leverage container prebuilding workflow to construct Runner's
>>>> container on the SDK side, with the knowledge of locally-installed
>>>> dependency versions.
>>>> >   - document how to launch pipeline from the SDK container,
>>>> especially for pipelines using a custom container. This would guarantee
>>>> exact match of dependencies. This can also prevent Python minor version
>>>> mismatch. Some runners can make it easier with features like Dataflow Flex
>>>> Templates.
>>>> >
>>>> >
>>>> >>
>>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>> >> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>> >>
>>>> >> On Fri, Apr 30, 2021 at 9:34 AM Valentyn Tymofieiev <
>>>> [email protected]> wrote:
>>>> >>>
>>>> >>> Hi Stephan,
>>>> >>>
>>>> >>> Thanks for reaching out. We first considered switching to
>>>> cloudpickle when adding Python 3 support[1], and there is a tracking
>>>> issue[2]. We were able to fix or work around missing Py3 in dill, features
>>>> although some are still not working for us [3].
>>>> >>> I agree that Beam can and should support cloudpickle as a pickler.
>>>> Practically, we can make cloudpickle the default pickler starting from a
>>>> particular python version, for example we are planning to add Python 3.9
>>>> support and we can try to make cloudpickle the default pickler for this
>>>> version to avoid breaking users while ironing out rough edges.
>>>> >>>
>>>> >>> My main concern is client-server version range compatibility of the
>>>> pickler. When SDK creates the job representation, it serializes the objects
>>>> using the pickler used on the user's machine. When SDK deserializes the
>>>> objects on the Runner side, it uses the pickler installed on the runner,
>>>> for example it can be a dill version installed the docker container
>>>> provided by Beam or Dataflow. We have been burned in the past by having an
>>>> open version bound for the pickler in Beam's requirements: client side
>>>> would pick the newest version, but runner container would have a somewhat
>>>> older version, either because the container did not have the new version,
>>>> or because some pipeline dependency wanted to downgrade dill. Older version
>>>> of pickler did not correctly deserialize new pickles. I suspect cloudpickle
>>>> may have the same problem. A solution was to have a very tight version
>>>> range for the pickler in SDK's requirements [4]. Given that dill is not a
>>>> popular dependency, the tight range did not create much friction for Beam
>>>> users. I think with cloudpickle we will not be able have a tight range.  We
>>>> could solve this problem by passing the version of pickler used at job
>>>> submission, and have a check on the runner to make sure that the client
>>>> version is not newer than the runner's version. Additionally, we should
>>>> make sure cloudpickle is backwards compatible (newer version can
>>>> deserialize objects created by older version).
>>>> >>>
>>>> >>> [1]
>>>> https://lists.apache.org/thread.html/d431664a3fc1039faa01c10e2075659288aec5961c7b4b59d9f7b889%40%3Cdev.beam.apache.org%3E
>>>> >>> [2] https://issues.apache.org/jira/browse/BEAM-8123
>>>> >>> [3]
>>>> https://github.com/uqfoundation/dill/issues/300#issuecomment-525409202
>>>> >>> [4]
>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L138-L143
>>>> >>>
>>>> >>> On Thu, Apr 29, 2021 at 8:04 PM Stephan Hoyer <[email protected]>
>>>> wrote:
>>>> >>>>
>>>> >>>> cloudpickle [1] and dill [2] are two Python packages that
>>>> implement extensions of Python's pickle protocol for arbitrary objects.
>>>> Beam currently uses dill, but I'm wondering if we could consider
>>>> additionally or alternatively use cloudpickle instead.
>>>> >>>>
>>>> >>>> Overall, cloudpickle seems to be a more popular choice for
>>>> extended pickle support in distributing computing in Python, e.g., it's
>>>> used by Spark, Dask and joblib.
>>>> >>>>
>>>> >>>> One of the major differences between cloudpickle and dill is how
>>>> they handle pickling global variables (such as Python modules) that are
>>>> referred to by a function:
>>>> >>>> - Dill doesn't serialize globals. If you want to save globals, you
>>>> need to call dill.dump_session(). This is what the "save_main_session" flag
>>>> does in Beam.
>>>> >>>> - Cloudpickle takes a different approach. It introspects which
>>>> global variables are used by a function, and creates a closure around the
>>>> serialized function that only contains these variables.
>>>> >>>>
>>>> >>>> The cloudpickle approach results in larger serialized functions,
>>>> but it's also much more robust, because the required globals are included
>>>> by default. In contrast, with dill, one either needs to save all globals or
>>>> none. This is repeated pain-point for Beam Python users [3]:
>>>> >>>> - Saving all globals can be overly aggressive, particularly in
>>>> notebooks where users may have incidentally created large objects.
>>>> >>>> - Alternatively, users can avoid using global variables entirely,
>>>> but this makes defining ad-hoc pipelines very awkward. Mapped over
>>>> functions need to be imported from other modules, or need to have their
>>>> imports defined inside the function itself.
>>>> >>>>
>>>> >>>> I'd love to see an option to use cloudpickle in Beam instead of
>>>> dill, and to consider switching over entirely. Cloudpickle would allow Beam
>>>> users to write readable code in the way they expect, without needing to
>>>> worry about the confusing and potentially problematic "save_main_session"
>>>> flag.
>>>> >>>>
>>>> >>>> Any thoughts?
>>>> >>>>
>>>> >>>> Cheers,
>>>> >>>> Stephan
>>>> >>>>
>>>> >>>> [1] https://github.com/cloudpipe/cloudpickle
>>>> >>>> [2] https://github.com/uqfoundation/dill
>>>> >>>> [3]
>>>> https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors
>>>> >>>>
>>>>
>>>
>>
>> --
>> +48 660 796 129 <+48%20660%20796%20129>
>>
>

Re: Consider Cloudpickle instead of dill for Python pickling

Reply via email to