On Thu, Oct 26, 2023 at 3:59 AM Johanna Öjeling <joha...@ojeling.net> wrote:
>
> Hi,
>
> I like this idea of making it easier to push out improvements, and had a look 
> at the PR.
>
> One question to better understand how it works today:
>
> The upgrades that the runners do, such as those not visible to the user, can 
> they be initiated at any time or do they only happen in relation to that the 
> user updates the running pipeline e.g. with new user code?

Correct. We're talking about user-initiated changes to their pipeline here.

> And, assuming the former, some reflections that came to mind when reviewing 
> the changes:
>
> Will the update_compatibility_version option be effective both when creating 
> and updating a pipeline? It is grouped with the update options in the Python 
> SDK, but users may want to configure the compatibility already when launching 
> the pipeline.

It will be effective for both, though generally there's little
motivation to not always use the "latest" version when creating a new
pipeline.

> Would it be possible to revert setting a fixed prior version, i.e. 
> (re-)enable upgrades?

The contract would be IF you start with version X (which logically
defaults to the current SDK), THEN all updates also setting this to
version X (even on SDKs > X) should work.

> If yes: in practice, would this motivate another option, or passing a value 
> like "auto" or "latest" to update_compatibility_version?

Unset is interpreted as latest. Auto is hard, because it would involve
querying the runner before pipeline construction, and we may not even
know what the runner is at this point. (Eventually we could do things
like embed both alternative into the graph and let the runner choose,
but this is more speculative and may not be as scalable.)

> The option is being introduced to the Java and Python SDKs. Should this also 
> be applicable to the Go SDK?

Yes, allowing setting this value should be done for Go (and
typescript, and future SDKs) too. As Robert Burke mentioned, we need
to respect the value in those SDKs that have expansion service
implementations first.

> On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev <dev@beam.apache.org> 
> wrote:
>>
>> Dataflow (among other runners) has the ability to "upgrade" running
>> pipelines with new code (e.g. capturing bug fixes, dependency updates,
>> and limited topology changes). Unfortunately some improvements (e.g.
>> new and improved ways of writing to BigQuery, optimized use of side
>> inputs, a change in algorithm, sometimes completely internally and not
>> visible to the user) are not sufficiently backwards compatible which
>> causes us, with the motivation to not break users, to either not make
>> these changes or guard them as a parallel opt-in mode which is a
>> significant drain on both developer productivity and causes new
>> pipelines to run in obsolete modes by default.
>>
>> I created https://github.com/apache/beam/pull/29140 which adds a new
>> pipeline option, update_compatibility_version, that allows the SDK to
>> move forward while letting users with pipelines launched previously to
>> manually request the "old" way of doing things to preserve update
>> compatibility. (We should still attempt backwards compatibility when
>> it makes sense, and the old way would remain in code until such a time
>> it's actually deprecated and removed, but this means we won't be
>> constrained by it, especially when it comes to default settings.)
>>
>> Any objections or other thoughts on this approach?
>>
>> - Robert
>>
>> P.S. Separately I think it'd be valuable to elevate the vague notion
>> of update compatibility to a first-class Beam concept and put it on
>> firm footing, but that's a larger conversation outside the thread of
>> this smaller (and I think still useful in such a future world) change.

Reply via email to