On Fri, Apr 30, 2021 at 9:53 AM Brian Hulette <[email protected]> wrote:

> > I think with cloudpickle we will not be able have a tight range.
>
> If cloudpickle is backwards compatible, we should be able to just keep an
> upper bound in setup.py [1] synced up with a pinned version in
> base_image_requirements.txt [2], right?
>

With an upper bound only, dependency resolver could still downgrade pickler
on the runner' side, ideally we should be detecting that.

Also if we ever depend on a newer functionality, we would add a lower bound
as well, which (for that particular Beam release), makes it a tight bound,
so potentially a friction point.


>
> > We could solve this problem by passing the version of pickler used at
> job submission
>
> A bit of a digression, but it may be worth considering something more
> general here, for a couple of reasons:
> - I've had a similar concern for the Beam DataFrame API. Our goal is for
> it to match the behavior of the pandas version used at construction time,
> but we could get into some surprising edge cases if the version of pandas
> used to compute partial results in the SDK harness is different.
> - Occasionally we have Dataflow customers report
> NameErrors/AttributeErrors that can be attributed to a dependency mismatch.
> It would be nice to proactively warn about this.
>
>
That being said I imagine it would be hard to do something truly general
> since every dependency will have different compatibility guarantees.
>
> I think it should be considered a best practice to have matching
dependencies on job submission and execution side. We can:
1)  consider sending a manifest of all locally installed dependencies to
the runner and verify on the runner's side that critical dependencies are
compatible.
2) help make it easier to ensure the dependencies match:
  - leverage container prebuilding workflow to construct Runner's container
on the SDK side, with the knowledge of locally-installed dependency
versions.
  - document how to launch pipeline from the SDK container, especially for
pipelines using a custom container. This would guarantee exact match of
dependencies. This can also prevent Python minor version mismatch. Some
runners can make it easier with features like Dataflow Flex Templates.



> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
> [2]
> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>
> On Fri, Apr 30, 2021 at 9:34 AM Valentyn Tymofieiev <[email protected]>
> wrote:
>
>> Hi Stephan,
>>
>> Thanks for reaching out. We first considered switching to cloudpickle
>> when adding Python 3 support[1], and there is a tracking issue[2]. We were
>> able to fix or work around missing Py3 in dill, features although some are
>> still not working for us [3].
>> I agree that Beam can and should support cloudpickle as a pickler.
>> Practically, we can make cloudpickle the default pickler starting from a
>> particular python version, for example we are planning to add Python 3.9
>> support and we can try to make cloudpickle the default pickler for this
>> version to avoid breaking users while ironing out rough edges.
>>
>> My main concern is client-server version range compatibility of the
>> pickler. When SDK creates the job representation, it serializes the objects
>> using the pickler used on the user's machine. When SDK deserializes the
>> objects on the Runner side, it uses the pickler installed on the runner,
>> for example it can be a dill version installed the docker container
>> provided by Beam or Dataflow. We have been burned in the past by having an
>> open version bound for the pickler in Beam's requirements: client side
>> would pick the newest version, but runner container would have a somewhat
>> older version, either because the container did not have the new version,
>> or because some pipeline dependency wanted to downgrade dill. Older version
>> of pickler did not correctly deserialize new pickles. I suspect cloudpickle
>> may have the same problem. A solution was to have a very tight version
>> range for the pickler in SDK's requirements [4]. Given that dill is not a
>> popular dependency, the tight range did not create much friction for Beam
>> users. I think with cloudpickle we will not be able have a tight range.  We
>> could solve this problem by passing the version of pickler used at job
>> submission, and have a check on the runner to make sure that the client
>> version is not newer than the runner's version. Additionally, we should
>> make sure cloudpickle is backwards compatible (newer version can
>> deserialize objects created by older version).
>>
>> [1]
>> https://lists.apache.org/thread.html/d431664a3fc1039faa01c10e2075659288aec5961c7b4b59d9f7b889%40%3Cdev.beam.apache.org%3E
>> [2] https://issues.apache.org/jira/browse/BEAM-8123
>> [3]
>> https://github.com/uqfoundation/dill/issues/300#issuecomment-525409202
>> [4]
>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L138-L143
>>
>> On Thu, Apr 29, 2021 at 8:04 PM Stephan Hoyer <[email protected]> wrote:
>>
>>> cloudpickle [1] and dill [2] are two Python packages that implement
>>> extensions of Python's pickle protocol for arbitrary objects. Beam
>>> currently uses dill, but I'm wondering if we could consider additionally or
>>> alternatively use cloudpickle instead.
>>>
>>> Overall, cloudpickle seems to be a more popular choice for extended
>>> pickle support in distributing computing in Python, e.g., it's used by
>>> Spark, Dask and joblib.
>>>
>>> One of the major differences between cloudpickle and dill is how they
>>> handle pickling global variables (such as Python modules) that are referred
>>> to by a function:
>>> - Dill doesn't serialize globals. If you want to save globals, you need
>>> to call dill.dump_session(). This is what the "save_main_session" flag does
>>> in Beam.
>>> - Cloudpickle takes a different approach. It introspects which global
>>> variables are used by a function, and creates a closure around the
>>> serialized function that only contains these variables.
>>>
>>> The cloudpickle approach results in larger serialized functions, but
>>> it's also much more robust, because the required globals are included by
>>> default. In contrast, with dill, one either needs to save *all *globals
>>> or none. This is repeated pain-point for Beam Python users [3]:
>>> - Saving all globals can be overly aggressive, particularly in notebooks
>>> where users may have incidentally created large objects.
>>> - Alternatively, users can avoid using global variables entirely, but
>>> this makes defining ad-hoc pipelines very awkward. Mapped over functions
>>> need to be imported from other modules, or need to have their imports
>>> defined inside the function itself.
>>>
>>> I'd love to see an option to use cloudpickle in Beam instead of dill,
>>> and to consider switching over entirely. Cloudpickle would allow Beam users
>>> to write readable code in the way they expect, without needing to worry
>>> about the confusing and potentially problematic "save_main_session" flag.
>>>
>>> Any thoughts?
>>>
>>> Cheers,
>>> Stephan
>>>
>>> [1] https://github.com/cloudpipe/cloudpickle
>>> [2] https://github.com/uqfoundation/dill
>>> [3]
>>> https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors
>>>
>>>

Reply via email to