I agree with Robert on this one. With the exception of DillCoder, it might be reasonable to conditionally support both. (On a related note, I only see one use of DillCoder, do we really need that coder?)
On Mon, May 3, 2021 at 5:01 AM Kenneth Knowles <[email protected]> wrote: > My 2 contradictory cents from even further back in the peanut gallery: > > - Pickle/dill/cloudpickle/etc are most suitable for transmission, not > storage, so changing is a lesser breaking change. But there still might be > streaming pipelines that are using it can cannot be updated. > I believe the problem arises from transmission between a client and a worker when they are not using compatible libraries. Much history here ( https://github.com/uqfoundation/dill/issues/341) but I do not think dill is necessarily at fault here. I think we will need to ensure using compatible libraries in both environments. > - A serialization library with an unstable/breaking serialization format > is not production-ready. If open version ranges are unsafe, that is an > indication that it is not ready for use. > It was a bug and it happened twice according to the maintainer. (See the issue above). I am not sure it will be much better with a different library. > - We should use whatever the rest of the world uses. That is more > important than either of the above two points. > Both are similarly popular. Looks like dill is a bit (50%) more popular ( https://libraries.io/pypi/dill vs https://libraries.io/pypi/cloudpickle) and has a more recent release. > > Kenn > > On Sat, May 1, 2021 at 5:15 AM Jarek Potiuk <[email protected]> wrote: > >> Just my 2 cents comment from the users perspective. >> >> In Airflow, the narrow limits of `dill` caused some problems with >> dependencies. We had to add some exceptions in our process for that: >> https://github.com/apache/airflow/blob/master/Dockerfile#L246 >> https://github.com/apache/airflow/blob/master/Dockerfile.ci#L271 - so >> the problem is largely solved for now, but if dill would be used by any >> different library it could be a problem. I imagine cloudpickle is more >> frequently used than dill, so it might become a problem if those >> dependencies are narrowly defined. >> >> Currently cloudpickle for Airflow is already pulled in by >> Dask's "distributed" library (but they have just > limits there): >> >> distributed==2.19.0 >> - click [required: >=6.6, installed: 7.1.2] >> - cloudpickle [required: >=1.3.0, installed: 1.4.1] >> - dask [required: >=2.9.0, installed: 2021.4.1] >> - cloudpickle [required: >=1.1.1, installed: 1.4.1] >> - fsspec [required: >=0.6.0, installed: 2021.4.0] >> >> However, I have a better idea - why don't you simply vendor-in either >> `dill` or `cloudpickle` (I am not sure which one is best) ? >> >> Since you are not planning to upgrade it often (that's the whole point of >> narrow versioning), you can have the best of both worlds - stable version >> used in both client/server AND you would not be limiting others. >> >> J. >> >> >> On Fri, Apr 30, 2021 at 9:42 PM Stephan Hoyer <[email protected]> wrote: >> >>> Glad to hear this is something you've open to and in fact have already >>> considered :) >>> >>> I may give implementing this a try, though I'm not familiar with how >>> configuration options are managed in Beam, so that may be easier for a core >>> developer to deal with. >>> >>> On Fri, Apr 30, 2021 at 10:58 AM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> As I've mentioned before, I would be in favor of moving to cloudpicke, >>>> first as an option, and if that works out well as the default. In >>>> particular, pickling functions from the main session in a hermetic (if >>>> sometimes slightly bulkier way) way as opposed to the main session >>>> pickling gymnastics is far preferable (especially for interactive). >>>> >>>> Versioning is an issue in general, and a tradeoff between the >>>> overheads of re-building the worker every time (either custom >>>> containers or at runtime) vs. risking different versions, and we could >>>> possibly do better more generally on both fronts (as well as making >>>> this tradeoff clear). Fair point that Cloudpickle is less likely to >>>> just work with pinning. On the other hand, Cloudpickle looks fairly >>>> mature/stable at this point, so hopefully it wouldn't be too hard to >>>> keep our containers closet to head. If there is an error, we could >>>> consider catching it and raising a more explicit message about the >>>> version things were pickled vs. unpickled with. >>>> >>>> I would welcome as a first step a PR that conditionally allows the use >>>> of CloudPickle in the place of Dill (with the exception of DillCoder, >>>> there should of course probably be a separate CloudPickleCoder). >>>> >>>> On Fri, Apr 30, 2021 at 10:17 AM Valentyn Tymofieiev >>>> <[email protected]> wrote: >>>> > >>>> > >>>> > >>>> > On Fri, Apr 30, 2021 at 9:53 AM Brian Hulette <[email protected]> >>>> wrote: >>>> >> >>>> >> > I think with cloudpickle we will not be able have a tight range. >>>> >> >>>> >> If cloudpickle is backwards compatible, we should be able to just >>>> keep an upper bound in setup.py [1] synced up with a pinned version in >>>> base_image_requirements.txt [2], right? >>>> > >>>> > >>>> > With an upper bound only, dependency resolver could still downgrade >>>> pickler on the runner' side, ideally we should be detecting that. >>>> > >>>> > Also if we ever depend on a newer functionality, we would add a lower >>>> bound as well, which (for that particular Beam release), makes it a tight >>>> bound, so potentially a friction point. >>>> > >>>> >> >>>> >> >>>> >> > We could solve this problem by passing the version of pickler used >>>> at job submission >>>> >> >>>> >> A bit of a digression, but it may be worth considering something >>>> more general here, for a couple of reasons: >>>> >> - I've had a similar concern for the Beam DataFrame API. Our goal is >>>> for it to match the behavior of the pandas version used at construction >>>> time, but we could get into some surprising edge cases if the version of >>>> pandas used to compute partial results in the SDK harness is different. >>>> >> - Occasionally we have Dataflow customers report >>>> NameErrors/AttributeErrors that can be attributed to a dependency mismatch. >>>> It would be nice to proactively warn about this. >>>> >> >>>> >> >>>> >> That being said I imagine it would be hard to do something truly >>>> general since every dependency will have different compatibility >>>> guarantees. >>>> >> >>>> > I think it should be considered a best practice to have matching >>>> dependencies on job submission and execution side. We can: >>>> > 1) consider sending a manifest of all locally installed dependencies >>>> to the runner and verify on the runner's side that critical dependencies >>>> are compatible. >>>> > 2) help make it easier to ensure the dependencies match: >>>> > - leverage container prebuilding workflow to construct Runner's >>>> container on the SDK side, with the knowledge of locally-installed >>>> dependency versions. >>>> > - document how to launch pipeline from the SDK container, >>>> especially for pipelines using a custom container. This would guarantee >>>> exact match of dependencies. This can also prevent Python minor version >>>> mismatch. Some runners can make it easier with features like Dataflow Flex >>>> Templates. >>>> > >>>> > >>>> >> >>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py >>>> >> [2] >>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt >>>> >> >>>> >> On Fri, Apr 30, 2021 at 9:34 AM Valentyn Tymofieiev < >>>> [email protected]> wrote: >>>> >>> >>>> >>> Hi Stephan, >>>> >>> >>>> >>> Thanks for reaching out. We first considered switching to >>>> cloudpickle when adding Python 3 support[1], and there is a tracking >>>> issue[2]. We were able to fix or work around missing Py3 in dill, features >>>> although some are still not working for us [3]. >>>> >>> I agree that Beam can and should support cloudpickle as a pickler. >>>> Practically, we can make cloudpickle the default pickler starting from a >>>> particular python version, for example we are planning to add Python 3.9 >>>> support and we can try to make cloudpickle the default pickler for this >>>> version to avoid breaking users while ironing out rough edges. >>>> >>> >>>> >>> My main concern is client-server version range compatibility of the >>>> pickler. When SDK creates the job representation, it serializes the objects >>>> using the pickler used on the user's machine. When SDK deserializes the >>>> objects on the Runner side, it uses the pickler installed on the runner, >>>> for example it can be a dill version installed the docker container >>>> provided by Beam or Dataflow. We have been burned in the past by having an >>>> open version bound for the pickler in Beam's requirements: client side >>>> would pick the newest version, but runner container would have a somewhat >>>> older version, either because the container did not have the new version, >>>> or because some pipeline dependency wanted to downgrade dill. Older version >>>> of pickler did not correctly deserialize new pickles. I suspect cloudpickle >>>> may have the same problem. A solution was to have a very tight version >>>> range for the pickler in SDK's requirements [4]. Given that dill is not a >>>> popular dependency, the tight range did not create much friction for Beam >>>> users. I think with cloudpickle we will not be able have a tight range. We >>>> could solve this problem by passing the version of pickler used at job >>>> submission, and have a check on the runner to make sure that the client >>>> version is not newer than the runner's version. Additionally, we should >>>> make sure cloudpickle is backwards compatible (newer version can >>>> deserialize objects created by older version). >>>> >>> >>>> >>> [1] >>>> https://lists.apache.org/thread.html/d431664a3fc1039faa01c10e2075659288aec5961c7b4b59d9f7b889%40%3Cdev.beam.apache.org%3E >>>> >>> [2] https://issues.apache.org/jira/browse/BEAM-8123 >>>> >>> [3] >>>> https://github.com/uqfoundation/dill/issues/300#issuecomment-525409202 >>>> >>> [4] >>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L138-L143 >>>> >>> >>>> >>> On Thu, Apr 29, 2021 at 8:04 PM Stephan Hoyer <[email protected]> >>>> wrote: >>>> >>>> >>>> >>>> cloudpickle [1] and dill [2] are two Python packages that >>>> implement extensions of Python's pickle protocol for arbitrary objects. >>>> Beam currently uses dill, but I'm wondering if we could consider >>>> additionally or alternatively use cloudpickle instead. >>>> >>>> >>>> >>>> Overall, cloudpickle seems to be a more popular choice for >>>> extended pickle support in distributing computing in Python, e.g., it's >>>> used by Spark, Dask and joblib. >>>> >>>> >>>> >>>> One of the major differences between cloudpickle and dill is how >>>> they handle pickling global variables (such as Python modules) that are >>>> referred to by a function: >>>> >>>> - Dill doesn't serialize globals. If you want to save globals, you >>>> need to call dill.dump_session(). This is what the "save_main_session" flag >>>> does in Beam. >>>> >>>> - Cloudpickle takes a different approach. It introspects which >>>> global variables are used by a function, and creates a closure around the >>>> serialized function that only contains these variables. >>>> >>>> >>>> >>>> The cloudpickle approach results in larger serialized functions, >>>> but it's also much more robust, because the required globals are included >>>> by default. In contrast, with dill, one either needs to save all globals or >>>> none. This is repeated pain-point for Beam Python users [3]: >>>> >>>> - Saving all globals can be overly aggressive, particularly in >>>> notebooks where users may have incidentally created large objects. >>>> >>>> - Alternatively, users can avoid using global variables entirely, >>>> but this makes defining ad-hoc pipelines very awkward. Mapped over >>>> functions need to be imported from other modules, or need to have their >>>> imports defined inside the function itself. >>>> >>>> >>>> >>>> I'd love to see an option to use cloudpickle in Beam instead of >>>> dill, and to consider switching over entirely. Cloudpickle would allow Beam >>>> users to write readable code in the way they expect, without needing to >>>> worry about the confusing and potentially problematic "save_main_session" >>>> flag. >>>> >>>> >>>> >>>> Any thoughts? >>>> >>>> >>>> >>>> Cheers, >>>> >>>> Stephan >>>> >>>> >>>> >>>> [1] https://github.com/cloudpipe/cloudpickle >>>> >>>> [2] https://github.com/uqfoundation/dill >>>> >>>> [3] >>>> https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors >>>> >>>> >>>> >>> >> >> -- >> +48 660 796 129 <+48%20660%20796%20129> >> >
