Hey all,

I wrote up a quick update on the status of replacing dill
https://docs.google.com/document/d/1XypNkB0ujc-U2hy9PuJNYj6asY3tZyGoKvfFDbRux_Y/edit?usp=sharing
.

There is one remaining blocker (dill is used to deterministically encode
special types by default) that I discuss further in
https://docs.google.com/document/d/1nJLdUMdgLM4MfUOQBDZQfPZWl4LE2wze1Acq6Dn2Ts4/edit?usp=sharing

Please take a look at the second doc especially, I'd appreciate any
feedback to get this moving forward.

Best,
Claude


On Wed, Apr 30, 2025 at 12:54 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> Ah yes, and no more saving the main session :)
>
> > FWIW - I noticed that the DataFlow Options documentation[1] for setting
> the pickling library and the Beam documentation
>
> Thanks for bringing it up. The doc is outdated, the issue was fixed in
> https://github.com/apache/beam/issues/21615 .
>
> On Wed, Apr 30, 2025 at 6:28 AM Joey Tran <joey.t...@schrodinger.com>
> wrote:
>
>> Wow this is fantastic! I tested it out and it worked great for my runner.
>> I am also excited for this change now and will eagerly set `cloudpickle` as
>> the default pickler for our code.
>>
>> FWIW - I noticed that the DataFlow Options documentation[1] for setting
>> the pickling library and the Beam documentation [2] for setting the
>> pickling library differ slightly. The DataFlow documentation mentions
>> having to set `pickler.set_library(pickler.USE_CLOUDPICKLE)` while the Beam
>> documentation doesn't say anything about that. It turned out unnecessary
>> for my runner - not sure if it's just a DataFlow runner specific
>> requirement, but just wanted to point it out.
>>
>> [1]
>> https://cloud.google.com/dataflow/docs/reference/pipeline-options#pythonyaml
>> [2]
>> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session
>>
>> On Wed, Apr 30, 2025 at 2:38 AM Robert Bradshaw <rober...@waymo.com>
>> wrote:
>>
>>> On Tue, Apr 29, 2025 at 7:51 PM Joey Tran <joey.t...@schrodinger.com>
>>> wrote:
>>> >
>>> > Does cloudpickle make --save_main_session unnecessary? As in, will
>>> more transforms defined in __main__ "just work"?
>>>
>>> Yes. Or at least it "just works" much more often. (There may still be
>>> corner cases, but I haven't run into them...)
>>>
>>> I, for one, am excited to see this change. Thanks, Claude, for taking
>>> the lead on this.
>>>
>>> > If so, I can see why that's worthwhile. I've had a _ton_ of issues
>>> with this, especially with new users of beam at my company. Explaining main
>>> session and why random things throw unpickling errors or why their
>>> transform is throwing Name errors has been a very painful experience,
>>> especially since it usually happens with users first experiences
>>> >
>>> > On Tue, Apr 29, 2025, 6:14 PM Valentyn Tymofieiev via dev <
>>> dev@beam.apache.org> wrote:
>>> >>
>>> >> There are several reasons:
>>> >>  - wide adoption in data processing community , see initial
>>> discussion: [1]
>>> >>  - expectations on cloudpickle having a larger number of maintainers
>>> and contributors.
>>> >>  - new releases of dill had breaking changes[2], which made adoption
>>> of a new version challenging.
>>> >>  - cloudpickle is easier to vendor - it is a single file and unlike
>>> dill, does not create side-effects in the global namespace, which might
>>> conflict with any unvendored version. vendoring allows to eliminate a
>>> common failure mode when the pickler library is different at submission and
>>> runtime.
>>> >>  - previously, some bugs and feature requests Beam requested in dill
>>> took a long time to be implemented and released.
>>> >>
>>> >> [1] https://lists.apache.org/thread/dvxvclhok0fx48955x6szvw4kotxh87n
>>> >> [2]
>>> https://github.com/apache/beam/issues/22893#issuecomment-1502354194
>>> >>
>>> >> On Mon, Apr 28, 2025 at 4:00 PM Joey Tran <joey.t...@schrodinger.com>
>>> wrote:
>>> >>>
>>> >>> Naive question, but why is beam upgrading to cloudpickle?
>>> >>>
>>> >>> I saw this doc:
>>> >>>
>>> https://docs.google.com/document/d/1G5Q0ckX5sKQRQD1yEkLCPQL7N6B-AL9Cb1p0zlOOfQU/edit?tab=t.0
>>> >>>
>>> >>> Is the main reason because cloudpickle is more actively maintained?
>>> >>>
>>> >>>
>>> >>> On Mon, Apr 28, 2025 at 6:51 PM Claudius van der Merwe <
>>> claud...@vdmza.com> wrote:
>>> >>>>
>>> >>>> Hi Beam Devs,
>>> >>>>
>>> >>>>
>>> >>>> I am making progress on making cloudpickle the default pickling
>>> library and removing the strict dependency on dill as outlined in
>>> https://s.apache.org/beam-cloudpickle-next-steps.
>>> >>>>
>>> >>>>
>>> >>>> The current plan  is to:
>>> >>>>
>>> >>>>
>>> >>>> 1. Make cloudpickle the default library in Beam 2.65.0 release (see
>>> https://github.com/apache/beam/pull/34695). Users will be able to
>>> specify pickle_library='dill' without any additional requirements. There
>>> will still be a hard dependency on dill (blocked by #2) but it is a step in
>>> the right direction.
>>> >>>>
>>> >>>>
>>> >>>> 2. Remove the strict dependency on dill in Beam 2.66.0 release.
>>> Dill is directly used for coder's encoding types in FastPrimitivesCoderImpl
>>> [1][2]. I prefer to submit a fix for this after the branch cut so we have
>>> more time to identify any issues.
>>> >>>>
>>> >>>>
>>> >>>> Coudpickle has some fundamentally different pickling behavior to
>>> dill that is likely to break:
>>> >>>>
>>> >>>> Unittests that rely on globals
>>> >>>>
>>> >>>> This can be fixed by using apache_beam.utils.shared [3]
>>> >>>>
>>> >>>> Closures and dynamic classes that reference unpicklable globals
>>> >>>>
>>> >>>> This can be fixed by defining functions in the top level, and using
>>> functools.partial to bind parameters if necessary
>>> >>>>
>>> >>>>
>>> >>>> [1]
>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L529
>>> >>>>
>>> >>>> [2]
>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L595
>>> >>>>
>>> >>>> [3]
>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/internal/cloudpickle_pickler_test.py#L54
>>> >>>>
>>> >>>>
>>> >>>> I'd appreciate any feedback or concerns.
>>> >>>>
>>> >>>>
>>> >>>> Best,
>>> >>>>
>>> >>>> Claude
>>> >>>>
>>> >>>>
>>>
>>

Reply via email to