Hey all, I wrote up a quick update on the status of replacing dill https://docs.google.com/document/d/1XypNkB0ujc-U2hy9PuJNYj6asY3tZyGoKvfFDbRux_Y/edit?usp=sharing .
There is one remaining blocker (dill is used to deterministically encode special types by default) that I discuss further in https://docs.google.com/document/d/1nJLdUMdgLM4MfUOQBDZQfPZWl4LE2wze1Acq6Dn2Ts4/edit?usp=sharing Please take a look at the second doc especially, I'd appreciate any feedback to get this moving forward. Best, Claude On Wed, Apr 30, 2025 at 12:54 PM Valentyn Tymofieiev via dev < dev@beam.apache.org> wrote: > Ah yes, and no more saving the main session :) > > > FWIW - I noticed that the DataFlow Options documentation[1] for setting > the pickling library and the Beam documentation > > Thanks for bringing it up. The doc is outdated, the issue was fixed in > https://github.com/apache/beam/issues/21615 . > > On Wed, Apr 30, 2025 at 6:28 AM Joey Tran <joey.t...@schrodinger.com> > wrote: > >> Wow this is fantastic! I tested it out and it worked great for my runner. >> I am also excited for this change now and will eagerly set `cloudpickle` as >> the default pickler for our code. >> >> FWIW - I noticed that the DataFlow Options documentation[1] for setting >> the pickling library and the Beam documentation [2] for setting the >> pickling library differ slightly. The DataFlow documentation mentions >> having to set `pickler.set_library(pickler.USE_CLOUDPICKLE)` while the Beam >> documentation doesn't say anything about that. It turned out unnecessary >> for my runner - not sure if it's just a DataFlow runner specific >> requirement, but just wanted to point it out. >> >> [1] >> https://cloud.google.com/dataflow/docs/reference/pipeline-options#pythonyaml >> [2] >> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session >> >> On Wed, Apr 30, 2025 at 2:38 AM Robert Bradshaw <rober...@waymo.com> >> wrote: >> >>> On Tue, Apr 29, 2025 at 7:51 PM Joey Tran <joey.t...@schrodinger.com> >>> wrote: >>> > >>> > Does cloudpickle make --save_main_session unnecessary? As in, will >>> more transforms defined in __main__ "just work"? >>> >>> Yes. Or at least it "just works" much more often. (There may still be >>> corner cases, but I haven't run into them...) >>> >>> I, for one, am excited to see this change. Thanks, Claude, for taking >>> the lead on this. >>> >>> > If so, I can see why that's worthwhile. I've had a _ton_ of issues >>> with this, especially with new users of beam at my company. Explaining main >>> session and why random things throw unpickling errors or why their >>> transform is throwing Name errors has been a very painful experience, >>> especially since it usually happens with users first experiences >>> > >>> > On Tue, Apr 29, 2025, 6:14 PM Valentyn Tymofieiev via dev < >>> dev@beam.apache.org> wrote: >>> >> >>> >> There are several reasons: >>> >> - wide adoption in data processing community , see initial >>> discussion: [1] >>> >> - expectations on cloudpickle having a larger number of maintainers >>> and contributors. >>> >> - new releases of dill had breaking changes[2], which made adoption >>> of a new version challenging. >>> >> - cloudpickle is easier to vendor - it is a single file and unlike >>> dill, does not create side-effects in the global namespace, which might >>> conflict with any unvendored version. vendoring allows to eliminate a >>> common failure mode when the pickler library is different at submission and >>> runtime. >>> >> - previously, some bugs and feature requests Beam requested in dill >>> took a long time to be implemented and released. >>> >> >>> >> [1] https://lists.apache.org/thread/dvxvclhok0fx48955x6szvw4kotxh87n >>> >> [2] >>> https://github.com/apache/beam/issues/22893#issuecomment-1502354194 >>> >> >>> >> On Mon, Apr 28, 2025 at 4:00 PM Joey Tran <joey.t...@schrodinger.com> >>> wrote: >>> >>> >>> >>> Naive question, but why is beam upgrading to cloudpickle? >>> >>> >>> >>> I saw this doc: >>> >>> >>> https://docs.google.com/document/d/1G5Q0ckX5sKQRQD1yEkLCPQL7N6B-AL9Cb1p0zlOOfQU/edit?tab=t.0 >>> >>> >>> >>> Is the main reason because cloudpickle is more actively maintained? >>> >>> >>> >>> >>> >>> On Mon, Apr 28, 2025 at 6:51 PM Claudius van der Merwe < >>> claud...@vdmza.com> wrote: >>> >>>> >>> >>>> Hi Beam Devs, >>> >>>> >>> >>>> >>> >>>> I am making progress on making cloudpickle the default pickling >>> library and removing the strict dependency on dill as outlined in >>> https://s.apache.org/beam-cloudpickle-next-steps. >>> >>>> >>> >>>> >>> >>>> The current plan is to: >>> >>>> >>> >>>> >>> >>>> 1. Make cloudpickle the default library in Beam 2.65.0 release (see >>> https://github.com/apache/beam/pull/34695). Users will be able to >>> specify pickle_library='dill' without any additional requirements. There >>> will still be a hard dependency on dill (blocked by #2) but it is a step in >>> the right direction. >>> >>>> >>> >>>> >>> >>>> 2. Remove the strict dependency on dill in Beam 2.66.0 release. >>> Dill is directly used for coder's encoding types in FastPrimitivesCoderImpl >>> [1][2]. I prefer to submit a fix for this after the branch cut so we have >>> more time to identify any issues. >>> >>>> >>> >>>> >>> >>>> Coudpickle has some fundamentally different pickling behavior to >>> dill that is likely to break: >>> >>>> >>> >>>> Unittests that rely on globals >>> >>>> >>> >>>> This can be fixed by using apache_beam.utils.shared [3] >>> >>>> >>> >>>> Closures and dynamic classes that reference unpicklable globals >>> >>>> >>> >>>> This can be fixed by defining functions in the top level, and using >>> functools.partial to bind parameters if necessary >>> >>>> >>> >>>> >>> >>>> [1] >>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L529 >>> >>>> >>> >>>> [2] >>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L595 >>> >>>> >>> >>>> [3] >>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/internal/cloudpickle_pickler_test.py#L54 >>> >>>> >>> >>>> >>> >>>> I'd appreciate any feedback or concerns. >>> >>>> >>> >>>> >>> >>>> Best, >>> >>>> >>> >>>> Claude >>> >>>> >>> >>>> >>> >>