On Fri, Apr 30, 2021 at 9:53 AM Brian Hulette <[email protected]> wrote:
> > I think with cloudpickle we will not be able have a tight range. > > If cloudpickle is backwards compatible, we should be able to just keep an > upper bound in setup.py [1] synced up with a pinned version in > base_image_requirements.txt [2], right? > With an upper bound only, dependency resolver could still downgrade pickler on the runner' side, ideally we should be detecting that. Also if we ever depend on a newer functionality, we would add a lower bound as well, which (for that particular Beam release), makes it a tight bound, so potentially a friction point. > > > We could solve this problem by passing the version of pickler used at > job submission > > A bit of a digression, but it may be worth considering something more > general here, for a couple of reasons: > - I've had a similar concern for the Beam DataFrame API. Our goal is for > it to match the behavior of the pandas version used at construction time, > but we could get into some surprising edge cases if the version of pandas > used to compute partial results in the SDK harness is different. > - Occasionally we have Dataflow customers report > NameErrors/AttributeErrors that can be attributed to a dependency mismatch. > It would be nice to proactively warn about this. > > That being said I imagine it would be hard to do something truly general > since every dependency will have different compatibility guarantees. > > I think it should be considered a best practice to have matching dependencies on job submission and execution side. We can: 1) consider sending a manifest of all locally installed dependencies to the runner and verify on the runner's side that critical dependencies are compatible. 2) help make it easier to ensure the dependencies match: - leverage container prebuilding workflow to construct Runner's container on the SDK side, with the knowledge of locally-installed dependency versions. - document how to launch pipeline from the SDK container, especially for pipelines using a custom container. This would guarantee exact match of dependencies. This can also prevent Python minor version mismatch. Some runners can make it easier with features like Dataflow Flex Templates. > [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt > > On Fri, Apr 30, 2021 at 9:34 AM Valentyn Tymofieiev <[email protected]> > wrote: > >> Hi Stephan, >> >> Thanks for reaching out. We first considered switching to cloudpickle >> when adding Python 3 support[1], and there is a tracking issue[2]. We were >> able to fix or work around missing Py3 in dill, features although some are >> still not working for us [3]. >> I agree that Beam can and should support cloudpickle as a pickler. >> Practically, we can make cloudpickle the default pickler starting from a >> particular python version, for example we are planning to add Python 3.9 >> support and we can try to make cloudpickle the default pickler for this >> version to avoid breaking users while ironing out rough edges. >> >> My main concern is client-server version range compatibility of the >> pickler. When SDK creates the job representation, it serializes the objects >> using the pickler used on the user's machine. When SDK deserializes the >> objects on the Runner side, it uses the pickler installed on the runner, >> for example it can be a dill version installed the docker container >> provided by Beam or Dataflow. We have been burned in the past by having an >> open version bound for the pickler in Beam's requirements: client side >> would pick the newest version, but runner container would have a somewhat >> older version, either because the container did not have the new version, >> or because some pipeline dependency wanted to downgrade dill. Older version >> of pickler did not correctly deserialize new pickles. I suspect cloudpickle >> may have the same problem. A solution was to have a very tight version >> range for the pickler in SDK's requirements [4]. Given that dill is not a >> popular dependency, the tight range did not create much friction for Beam >> users. I think with cloudpickle we will not be able have a tight range. We >> could solve this problem by passing the version of pickler used at job >> submission, and have a check on the runner to make sure that the client >> version is not newer than the runner's version. Additionally, we should >> make sure cloudpickle is backwards compatible (newer version can >> deserialize objects created by older version). >> >> [1] >> https://lists.apache.org/thread.html/d431664a3fc1039faa01c10e2075659288aec5961c7b4b59d9f7b889%40%3Cdev.beam.apache.org%3E >> [2] https://issues.apache.org/jira/browse/BEAM-8123 >> [3] >> https://github.com/uqfoundation/dill/issues/300#issuecomment-525409202 >> [4] >> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L138-L143 >> >> On Thu, Apr 29, 2021 at 8:04 PM Stephan Hoyer <[email protected]> wrote: >> >>> cloudpickle [1] and dill [2] are two Python packages that implement >>> extensions of Python's pickle protocol for arbitrary objects. Beam >>> currently uses dill, but I'm wondering if we could consider additionally or >>> alternatively use cloudpickle instead. >>> >>> Overall, cloudpickle seems to be a more popular choice for extended >>> pickle support in distributing computing in Python, e.g., it's used by >>> Spark, Dask and joblib. >>> >>> One of the major differences between cloudpickle and dill is how they >>> handle pickling global variables (such as Python modules) that are referred >>> to by a function: >>> - Dill doesn't serialize globals. If you want to save globals, you need >>> to call dill.dump_session(). This is what the "save_main_session" flag does >>> in Beam. >>> - Cloudpickle takes a different approach. It introspects which global >>> variables are used by a function, and creates a closure around the >>> serialized function that only contains these variables. >>> >>> The cloudpickle approach results in larger serialized functions, but >>> it's also much more robust, because the required globals are included by >>> default. In contrast, with dill, one either needs to save *all *globals >>> or none. This is repeated pain-point for Beam Python users [3]: >>> - Saving all globals can be overly aggressive, particularly in notebooks >>> where users may have incidentally created large objects. >>> - Alternatively, users can avoid using global variables entirely, but >>> this makes defining ad-hoc pipelines very awkward. Mapped over functions >>> need to be imported from other modules, or need to have their imports >>> defined inside the function itself. >>> >>> I'd love to see an option to use cloudpickle in Beam instead of dill, >>> and to consider switching over entirely. Cloudpickle would allow Beam users >>> to write readable code in the way they expect, without needing to worry >>> about the confusing and potentially problematic "save_main_session" flag. >>> >>> Any thoughts? >>> >>> Cheers, >>> Stephan >>> >>> [1] https://github.com/cloudpipe/cloudpickle >>> [2] https://github.com/uqfoundation/dill >>> [3] >>> https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors >>> >>>
