Re: Consider Cloudpickle instead of dill for Python pickling

Stephan Hoyer Fri, 30 Apr 2021 12:43:10 -0700

Glad to hear this is something you've open to and in fact have already
considered :)


I may give implementing this a try, though I'm not familiar with how
configuration options are managed in Beam, so that may be easier for a core
developer to deal with.

On Fri, Apr 30, 2021 at 10:58 AM Robert Bradshaw <rober...@google.com>
wrote:

> As I've mentioned before, I would be in favor of moving to cloudpicke,
> first as an option, and if that works out well as the default. In
> particular, pickling functions from the main session in a hermetic (if
> sometimes slightly bulkier way) way as opposed to the main session
> pickling gymnastics is far preferable (especially for interactive).
>
> Versioning is an issue in general, and a tradeoff between the
> overheads of re-building the worker every time (either custom
> containers or at runtime) vs. risking different versions, and we could
> possibly do better more generally on both fronts (as well as making
> this tradeoff clear). Fair point that Cloudpickle is less likely to
> just work with pinning. On the other hand, Cloudpickle looks fairly
> mature/stable at this point, so hopefully it wouldn't be too hard to
> keep our containers closet to head. If there is an error, we could
> consider catching it and raising a more explicit message about the
> version things were pickled vs. unpickled with.
>
> I would welcome as a first step a PR that conditionally allows the use
> of CloudPickle in the place of Dill (with the exception of DillCoder,
> there should of course probably be a separate CloudPickleCoder).
>
> On Fri, Apr 30, 2021 at 10:17 AM Valentyn Tymofieiev
> <valen...@google.com> wrote:
> >
> >
> >
> > On Fri, Apr 30, 2021 at 9:53 AM Brian Hulette <bhule...@google.com>
> wrote:
> >>
> >> > I think with cloudpickle we will not be able have a tight range.
> >>
> >> If cloudpickle is backwards compatible, we should be able to just keep
> an upper bound in setup.py [1] synced up with a pinned version in
> base_image_requirements.txt [2], right?
> >
> >
> > With an upper bound only, dependency resolver could still downgrade
> pickler on the runner' side, ideally we should be detecting that.
> >
> > Also if we ever depend on a newer functionality, we would add a lower
> bound as well, which (for that particular Beam release), makes it a tight
> bound, so potentially a friction point.
> >
> >>
> >>
> >> > We could solve this problem by passing the version of pickler used at
> job submission
> >>
> >> A bit of a digression, but it may be worth considering something more
> general here, for a couple of reasons:
> >> - I've had a similar concern for the Beam DataFrame API. Our goal is
> for it to match the behavior of the pandas version used at construction
> time, but we could get into some surprising edge cases if the version of
> pandas used to compute partial results in the SDK harness is different.
> >> - Occasionally we have Dataflow customers report
> NameErrors/AttributeErrors that can be attributed to a dependency mismatch.
> It would be nice to proactively warn about this.
> >>
> >>
> >> That being said I imagine it would be hard to do something truly
> general since every dependency will have different compatibility guarantees.
> >>
> > I think it should be considered a best practice to have matching
> dependencies on job submission and execution side. We can:
> > 1)  consider sending a manifest of all locally installed dependencies to
> the runner and verify on the runner's side that critical dependencies are
> compatible.
> > 2) help make it easier to ensure the dependencies match:
> >   - leverage container prebuilding workflow to construct Runner's
> container on the SDK side, with the knowledge of locally-installed
> dependency versions.
> >   - document how to launch pipeline from the SDK container, especially
> for pipelines using a custom container. This would guarantee exact match of
> dependencies. This can also prevent Python minor version mismatch. Some
> runners can make it easier with features like Dataflow Flex Templates.
> >
> >
> >>
> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
> >> [2]
> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
> >>
> >> On Fri, Apr 30, 2021 at 9:34 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> >>>
> >>> Hi Stephan,
> >>>
> >>> Thanks for reaching out. We first considered switching to cloudpickle
> when adding Python 3 support[1], and there is a tracking issue[2]. We were
> able to fix or work around missing Py3 in dill, features although some are
> still not working for us [3].
> >>> I agree that Beam can and should support cloudpickle as a pickler.
> Practically, we can make cloudpickle the default pickler starting from a
> particular python version, for example we are planning to add Python 3.9
> support and we can try to make cloudpickle the default pickler for this
> version to avoid breaking users while ironing out rough edges.
> >>>
> >>> My main concern is client-server version range compatibility of the
> pickler. When SDK creates the job representation, it serializes the objects
> using the pickler used on the user's machine. When SDK deserializes the
> objects on the Runner side, it uses the pickler installed on the runner,
> for example it can be a dill version installed the docker container
> provided by Beam or Dataflow. We have been burned in the past by having an
> open version bound for the pickler in Beam's requirements: client side
> would pick the newest version, but runner container would have a somewhat
> older version, either because the container did not have the new version,
> or because some pipeline dependency wanted to downgrade dill. Older version
> of pickler did not correctly deserialize new pickles. I suspect cloudpickle
> may have the same problem. A solution was to have a very tight version
> range for the pickler in SDK's requirements [4]. Given that dill is not a
> popular dependency, the tight range did not create much friction for Beam
> users. I think with cloudpickle we will not be able have a tight range.  We
> could solve this problem by passing the version of pickler used at job
> submission, and have a check on the runner to make sure that the client
> version is not newer than the runner's version. Additionally, we should
> make sure cloudpickle is backwards compatible (newer version can
> deserialize objects created by older version).
> >>>
> >>> [1]
> https://lists.apache.org/thread.html/d431664a3fc1039faa01c10e2075659288aec5961c7b4b59d9f7b889%40%3Cdev.beam.apache.org%3E
> >>> [2] https://issues.apache.org/jira/browse/BEAM-8123
> >>> [3]
> https://github.com/uqfoundation/dill/issues/300#issuecomment-525409202
> >>> [4]
> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L138-L143
> >>>
> >>> On Thu, Apr 29, 2021 at 8:04 PM Stephan Hoyer <sho...@google.com>
> wrote:
> >>>>
> >>>> cloudpickle [1] and dill [2] are two Python packages that implement
> extensions of Python's pickle protocol for arbitrary objects. Beam
> currently uses dill, but I'm wondering if we could consider additionally or
> alternatively use cloudpickle instead.
> >>>>
> >>>> Overall, cloudpickle seems to be a more popular choice for extended
> pickle support in distributing computing in Python, e.g., it's used by
> Spark, Dask and joblib.
> >>>>
> >>>> One of the major differences between cloudpickle and dill is how they
> handle pickling global variables (such as Python modules) that are referred
> to by a function:
> >>>> - Dill doesn't serialize globals. If you want to save globals, you
> need to call dill.dump_session(). This is what the "save_main_session" flag
> does in Beam.
> >>>> - Cloudpickle takes a different approach. It introspects which global
> variables are used by a function, and creates a closure around the
> serialized function that only contains these variables.
> >>>>
> >>>> The cloudpickle approach results in larger serialized functions, but
> it's also much more robust, because the required globals are included by
> default. In contrast, with dill, one either needs to save all globals or
> none. This is repeated pain-point for Beam Python users [3]:
> >>>> - Saving all globals can be overly aggressive, particularly in
> notebooks where users may have incidentally created large objects.
> >>>> - Alternatively, users can avoid using global variables entirely, but
> this makes defining ad-hoc pipelines very awkward. Mapped over functions
> need to be imported from other modules, or need to have their imports
> defined inside the function itself.
> >>>>
> >>>> I'd love to see an option to use cloudpickle in Beam instead of dill,
> and to consider switching over entirely. Cloudpickle would allow Beam users
> to write readable code in the way they expect, without needing to worry
> about the confusing and potentially problematic "save_main_session" flag.
> >>>>
> >>>> Any thoughts?
> >>>>
> >>>> Cheers,
> >>>> Stephan
> >>>>
> >>>> [1] https://github.com/cloudpipe/cloudpickle
> >>>> [2] https://github.com/uqfoundation/dill
> >>>> [3]
> https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors
> >>>>
>

Re: Consider Cloudpickle instead of dill for Python pickling

Reply via email to