Re: Streaming update compatibility

2023-10-30 Thread Kenneth Knowles
+1 million to this.

I think this could be a real game-changer. I would even more forcefully say
update compatibility has pushed our development style has been pushed into
the "never make significant changes" or "every significant change is wildly
more complex than it should be". It forces our first draft to be our final
draft, much moreso than abstraction-based backwards-compatibility, because
it requires freezing many implementation details as well.

And just to put more non-subjective data behind my +1, I have used this
approach many times in situations where a new version of a service rolled
out while still serving older clients (using URL as the flag). It is a
tried-and-true technique and connecting it to Beam is like an epiphany.
Hooray!

The easiest way to ensure clean code is to make older versions more like
straight line code, flattening out cyclomatic complexity by forking
transforms at the top level. In other words FooIO.read() immediately
delegates to FooIO_2_48.read(). You shouldn't be checking this flag at a
bunch of separate places inside an IO. In fact I might say that should be
largely forbidden and it should only be used as a "routing" flag.

Kenn

On Wed, Oct 25, 2023 at 8:25 PM Robert Bradshaw via dev 
wrote:

> Dataflow (among other runners) has the ability to "upgrade" running
> pipelines with new code (e.g. capturing bug fixes, dependency updates,
> and limited topology changes). Unfortunately some improvements (e.g.
> new and improved ways of writing to BigQuery, optimized use of side
> inputs, a change in algorithm, sometimes completely internally and not
> visible to the user) are not sufficiently backwards compatible which
> causes us, with the motivation to not break users, to either not make
> these changes or guard them as a parallel opt-in mode which is a
> significant drain on both developer productivity and causes new
> pipelines to run in obsolete modes by default.
>
> I created https://github.com/apache/beam/pull/29140 which adds a new
> pipeline option, update_compatibility_version, that allows the SDK to
> move forward while letting users with pipelines launched previously to
> manually request the "old" way of doing things to preserve update
> compatibility. (We should still attempt backwards compatibility when
> it makes sense, and the old way would remain in code until such a time
> it's actually deprecated and removed, but this means we won't be
> constrained by it, especially when it comes to default settings.)
>
> Any objections or other thoughts on this approach?
>
> - Robert
>
> P.S. Separately I think it'd be valuable to elevate the vague notion
> of update compatibility to a first-class Beam concept and put it on
> firm footing, but that's a larger conversation outside the thread of
> this smaller (and I think still useful in such a future world) change.
>


Re: Streaming update compatibility

2023-10-27 Thread Robert Burke
On Fri, Oct 27, 2023, 9:09 AM Robert Bradshaw via dev 
wrote:

> On Fri, Oct 27, 2023 at 7:50 AM Kellen Dye via dev 
> wrote:
> >
> > > Auto is hard, because it would involve
> > > querying the runner before pipeline construction, and we may not even
> > > know what the runner is at this point
> >
> > At the point where pipeline construction will start, you should have
> access to the pipeline arguments and be able to determine the runner. What
> seems to be missing is a place to query the runner pre-construction. If
> that query could return metadata about the currently running version of the
> job, then that could be incorporated into graph construction as necessary.
>
> While this is the common case, it is not true in general. For example
> it's possible to cache the pipeline proto and submit it to a separate
> choice of runner later. We have Jobs API implementations that
> forward/proxy the job to other runners, and the Python interactive
> runner is another example where the runner is late-binding (e.g. one
> tries a sample locally, and if all looks good can execute remotely,
> and also in this case the graph that's submitted is often mutated
> before running).
>
> Also, in the spirit of the portability story, the pipeline definition
> itself should be runner-independent.
>
> > That same hook could be a place to for example return the
> currently-running job graph for pre-submission compatibility checks.
>
> I suppose we could add something to the Jobs API to make "looking up a
> previous version of this pipeline" runner-agnostic, though that
> assumes it's available at construction time.


As I pointed out,  we can access a given pipeline via the job management
API. It's already runner agnostic other than Dataflow.

Operationally though, we'd need to provide the option to "dry run" an
update locally, or validate update compatibility against a given pipeline
proto.

And +1 as Kellen says we

> should define (and be able to check) what pipeline compatibility means
> in a via graph-to-graph comparison at the Beam level. I'll defer both
> of these as future work as part of the "make update a portable Beam
> concept" project.
>

Big +1 to that. Hard to know what to check for without defining it. This
would avoid needing to ask a given runner WRT dry run updates.

It's on a longer term plan, but I have intended to add Pipeline Update as a
feature to Prism. As it becomes more fully featured, it becomes a great
test bed to develop the definitions.

>


Re: Streaming update compatibility

2023-10-27 Thread Robert Bradshaw via dev
On Fri, Oct 27, 2023 at 7:50 AM Kellen Dye via dev  wrote:
>
> > Auto is hard, because it would involve
> > querying the runner before pipeline construction, and we may not even
> > know what the runner is at this point
>
> At the point where pipeline construction will start, you should have access 
> to the pipeline arguments and be able to determine the runner. What seems to 
> be missing is a place to query the runner pre-construction. If that query 
> could return metadata about the currently running version of the job, then 
> that could be incorporated into graph construction as necessary.

While this is the common case, it is not true in general. For example
it's possible to cache the pipeline proto and submit it to a separate
choice of runner later. We have Jobs API implementations that
forward/proxy the job to other runners, and the Python interactive
runner is another example where the runner is late-binding (e.g. one
tries a sample locally, and if all looks good can execute remotely,
and also in this case the graph that's submitted is often mutated
before running).

Also, in the spirit of the portability story, the pipeline definition
itself should be runner-independent.

> That same hook could be a place to for example return the currently-running 
> job graph for pre-submission compatibility checks.

I suppose we could add something to the Jobs API to make "looking up a
previous version of this pipeline" runner-agnostic, though that
assumes it's available at construction time. And +1 as Kellen says we
should define (and be able to check) what pipeline compatibility means
in a via graph-to-graph comparison at the Beam level. I'll defer both
of these as future work as part of the "make update a portable Beam
concept" project.


Re: Streaming update compatibility

2023-10-27 Thread Kellen Dye via dev
In Spotify's case we deploy streaming jobs via CI and would ideally verify
compatibility as part of the build process before submitting to dataflow.
Perhaps decoupled from the _running_ pipeline if we had a cache of previous
pipeline versions.

Currently the user experience is poor because any merge of a change ends up
being potentially distant in time from the job submission and failure due
to incompatibility. There's the additional friction of a few layers of
infrastructure between the developer and the job failure logs which means
we can't trivially deliver the failure reason to the developer.



On Fri, Oct 27, 2023 at 11:07 AM Robert Burke  wrote:

> You raise a very good point:
>
>
> https://github.com/apache/beam/blob/master/model/job-management/src/main/proto/org/apache/beam/model/job_management/v1/beam_job_api.proto#L54
>
> The job management API does allow for the pipeline proto to be returned.
> So one could get the live job, so the SDK could make reasonable decisions
> before sending to the runner.
>
> Dataflow does have a similar API that xan be adapted.
>
> I am a touch concerned about spreading the update compatibility checks
> around between SDKs and Runners though. But in some cases it would be
> easier for the SDK, eg to ensure VersionA of a transform is used vs
> VersionB, based on the existing transforma used in the job being updated.
>
>
> On Fri, Oct 27, 2023, 7:50 AM Kellen Dye via dev 
> wrote:
>
>> > Auto is hard, because it would involve
>> > querying the runner before pipeline construction, and we may not even
>> > know what the runner is at this point
>>
>> At the point where pipeline construction will start, you should have
>> access to the pipeline arguments and be able to determine the runner. What
>> seems to be missing is a place to query the runner pre-construction. If
>> that query could return metadata about the currently running version of the
>> job, then that could be incorporated into graph construction as necessary.
>>
>> That same hook could be a place to for example return the
>> currently-running job graph for pre-submission compatibility checks.
>>
>>
>>


Re: Streaming update compatibility

2023-10-27 Thread Robert Burke
You raise a very good point:


https://github.com/apache/beam/blob/master/model/job-management/src/main/proto/org/apache/beam/model/job_management/v1/beam_job_api.proto#L54

The job management API does allow for the pipeline proto to be returned. So
one could get the live job, so the SDK could make reasonable decisions
before sending to the runner.

Dataflow does have a similar API that xan be adapted.

I am a touch concerned about spreading the update compatibility checks
around between SDKs and Runners though. But in some cases it would be
easier for the SDK, eg to ensure VersionA of a transform is used vs
VersionB, based on the existing transforma used in the job being updated.


On Fri, Oct 27, 2023, 7:50 AM Kellen Dye via dev 
wrote:

> > Auto is hard, because it would involve
> > querying the runner before pipeline construction, and we may not even
> > know what the runner is at this point
>
> At the point where pipeline construction will start, you should have
> access to the pipeline arguments and be able to determine the runner. What
> seems to be missing is a place to query the runner pre-construction. If
> that query could return metadata about the currently running version of the
> job, then that could be incorporated into graph construction as necessary.
>
> That same hook could be a place to for example return the
> currently-running job graph for pre-submission compatibility checks.
>
>
>


Re: Streaming update compatibility

2023-10-27 Thread Kellen Dye via dev
> Auto is hard, because it would involve
> querying the runner before pipeline construction, and we may not even
> know what the runner is at this point

At the point where pipeline construction will start, you should have access
to the pipeline arguments and be able to determine the runner. What seems
to be missing is a place to query the runner pre-construction. If that
query could return metadata about the currently running version of the job,
then that could be incorporated into graph construction as necessary.

That same hook could be a place to for example return the currently-running
job graph for pre-submission compatibility checks.


Re: Streaming update compatibility

2023-10-26 Thread Johanna Öjeling via dev
Alright, then it is clearer. Thank you for your answers!

On Thu, Oct 26, 2023, 20:36 Robert Bradshaw  wrote:

> On Thu, Oct 26, 2023 at 3:59 AM Johanna Öjeling 
> wrote:
> >
> > Hi,
> >
> > I like this idea of making it easier to push out improvements, and had a
> look at the PR.
> >
> > One question to better understand how it works today:
> >
> > The upgrades that the runners do, such as those not visible to the user,
> can they be initiated at any time or do they only happen in relation to
> that the user updates the running pipeline e.g. with new user code?
>
> Correct. We're talking about user-initiated changes to their pipeline here.
>
> > And, assuming the former, some reflections that came to mind when
> reviewing the changes:
> >
> > Will the update_compatibility_version option be effective both when
> creating and updating a pipeline? It is grouped with the update options in
> the Python SDK, but users may want to configure the compatibility already
> when launching the pipeline.
>
> It will be effective for both, though generally there's little
> motivation to not always use the "latest" version when creating a new
> pipeline.
>
> > Would it be possible to revert setting a fixed prior version, i.e.
> (re-)enable upgrades?
>
> The contract would be IF you start with version X (which logically
> defaults to the current SDK), THEN all updates also setting this to
> version X (even on SDKs > X) should work.
>
> > If yes: in practice, would this motivate another option, or passing a
> value like "auto" or "latest" to update_compatibility_version?
>
> Unset is interpreted as latest. Auto is hard, because it would involve
> querying the runner before pipeline construction, and we may not even
> know what the runner is at this point. (Eventually we could do things
> like embed both alternative into the graph and let the runner choose,
> but this is more speculative and may not be as scalable.)
>
> > The option is being introduced to the Java and Python SDKs. Should this
> also be applicable to the Go SDK?
>
> Yes, allowing setting this value should be done for Go (and
> typescript, and future SDKs) too. As Robert Burke mentioned, we need
> to respect the value in those SDKs that have expansion service
> implementations first.
>
> > On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> Dataflow (among other runners) has the ability to "upgrade" running
> >> pipelines with new code (e.g. capturing bug fixes, dependency updates,
> >> and limited topology changes). Unfortunately some improvements (e.g.
> >> new and improved ways of writing to BigQuery, optimized use of side
> >> inputs, a change in algorithm, sometimes completely internally and not
> >> visible to the user) are not sufficiently backwards compatible which
> >> causes us, with the motivation to not break users, to either not make
> >> these changes or guard them as a parallel opt-in mode which is a
> >> significant drain on both developer productivity and causes new
> >> pipelines to run in obsolete modes by default.
> >>
> >> I created https://github.com/apache/beam/pull/29140 which adds a new
> >> pipeline option, update_compatibility_version, that allows the SDK to
> >> move forward while letting users with pipelines launched previously to
> >> manually request the "old" way of doing things to preserve update
> >> compatibility. (We should still attempt backwards compatibility when
> >> it makes sense, and the old way would remain in code until such a time
> >> it's actually deprecated and removed, but this means we won't be
> >> constrained by it, especially when it comes to default settings.)
> >>
> >> Any objections or other thoughts on this approach?
> >>
> >> - Robert
> >>
> >> P.S. Separately I think it'd be valuable to elevate the vague notion
> >> of update compatibility to a first-class Beam concept and put it on
> >> firm footing, but that's a larger conversation outside the thread of
> >> this smaller (and I think still useful in such a future world) change.
>


Re: Streaming update compatibility

2023-10-26 Thread Robert Bradshaw via dev
On Thu, Oct 26, 2023 at 3:59 AM Johanna Öjeling  wrote:
>
> Hi,
>
> I like this idea of making it easier to push out improvements, and had a look 
> at the PR.
>
> One question to better understand how it works today:
>
> The upgrades that the runners do, such as those not visible to the user, can 
> they be initiated at any time or do they only happen in relation to that the 
> user updates the running pipeline e.g. with new user code?

Correct. We're talking about user-initiated changes to their pipeline here.

> And, assuming the former, some reflections that came to mind when reviewing 
> the changes:
>
> Will the update_compatibility_version option be effective both when creating 
> and updating a pipeline? It is grouped with the update options in the Python 
> SDK, but users may want to configure the compatibility already when launching 
> the pipeline.

It will be effective for both, though generally there's little
motivation to not always use the "latest" version when creating a new
pipeline.

> Would it be possible to revert setting a fixed prior version, i.e. 
> (re-)enable upgrades?

The contract would be IF you start with version X (which logically
defaults to the current SDK), THEN all updates also setting this to
version X (even on SDKs > X) should work.

> If yes: in practice, would this motivate another option, or passing a value 
> like "auto" or "latest" to update_compatibility_version?

Unset is interpreted as latest. Auto is hard, because it would involve
querying the runner before pipeline construction, and we may not even
know what the runner is at this point. (Eventually we could do things
like embed both alternative into the graph and let the runner choose,
but this is more speculative and may not be as scalable.)

> The option is being introduced to the Java and Python SDKs. Should this also 
> be applicable to the Go SDK?

Yes, allowing setting this value should be done for Go (and
typescript, and future SDKs) too. As Robert Burke mentioned, we need
to respect the value in those SDKs that have expansion service
implementations first.

> On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev  
> wrote:
>>
>> Dataflow (among other runners) has the ability to "upgrade" running
>> pipelines with new code (e.g. capturing bug fixes, dependency updates,
>> and limited topology changes). Unfortunately some improvements (e.g.
>> new and improved ways of writing to BigQuery, optimized use of side
>> inputs, a change in algorithm, sometimes completely internally and not
>> visible to the user) are not sufficiently backwards compatible which
>> causes us, with the motivation to not break users, to either not make
>> these changes or guard them as a parallel opt-in mode which is a
>> significant drain on both developer productivity and causes new
>> pipelines to run in obsolete modes by default.
>>
>> I created https://github.com/apache/beam/pull/29140 which adds a new
>> pipeline option, update_compatibility_version, that allows the SDK to
>> move forward while letting users with pipelines launched previously to
>> manually request the "old" way of doing things to preserve update
>> compatibility. (We should still attempt backwards compatibility when
>> it makes sense, and the old way would remain in code until such a time
>> it's actually deprecated and removed, but this means we won't be
>> constrained by it, especially when it comes to default settings.)
>>
>> Any objections or other thoughts on this approach?
>>
>> - Robert
>>
>> P.S. Separately I think it'd be valuable to elevate the vague notion
>> of update compatibility to a first-class Beam concept and put it on
>> firm footing, but that's a larger conversation outside the thread of
>> this smaller (and I think still useful in such a future world) change.


Re: Streaming update compatibility

2023-10-26 Thread Robert Burke
Regarding 3. I suspect Go wasn't changed because the PR is centering around
implementations of the Expansion Service server, not client callers. The Go
SDK doesn't yet have an expansion service.

On Thu, Oct 26, 2023, 3:59 AM Johanna Öjeling via dev 
wrote:

> Hi,
>
> I like this idea of making it easier to push out improvements, and had a
> look at the PR.
>
> One question to better understand how it works today:
>
>1. The upgrades that the runners do, such as those not visible to the
>user, can they be initiated at any time or do they only happen in relation
>to that the user updates the running pipeline e.g. with new user code?
>
> And, assuming the former, some reflections that came to mind when
> reviewing the changes:
>
>1. Will the update_compatibility_version option be effective both when
>creating and updating a pipeline? It is grouped with the update options in
>the Python SDK, but users may want to configure the compatibility already
>when launching the pipeline.
>2. Would it be possible to revert setting a fixed prior version, i.e.
>(re-)enable upgrades?
>   1. If yes: in practice, would this motivate another option, or
>   passing a value like "auto" or "latest" to update_compatibility_version?
>3. The option is being introduced to the Java and Python SDKs. Should
>this also be applicable to the Go SDK?
>
> Thanks,
> Johanna
>
> On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> Dataflow (among other runners) has the ability to "upgrade" running
>> pipelines with new code (e.g. capturing bug fixes, dependency updates,
>> and limited topology changes). Unfortunately some improvements (e.g.
>> new and improved ways of writing to BigQuery, optimized use of side
>> inputs, a change in algorithm, sometimes completely internally and not
>> visible to the user) are not sufficiently backwards compatible which
>> causes us, with the motivation to not break users, to either not make
>> these changes or guard them as a parallel opt-in mode which is a
>> significant drain on both developer productivity and causes new
>> pipelines to run in obsolete modes by default.
>>
>> I created https://github.com/apache/beam/pull/29140 which adds a new
>> pipeline option, update_compatibility_version, that allows the SDK to
>> move forward while letting users with pipelines launched previously to
>> manually request the "old" way of doing things to preserve update
>> compatibility. (We should still attempt backwards compatibility when
>> it makes sense, and the old way would remain in code until such a time
>> it's actually deprecated and removed, but this means we won't be
>> constrained by it, especially when it comes to default settings.)
>>
>> Any objections or other thoughts on this approach?
>>
>> - Robert
>>
>> P.S. Separately I think it'd be valuable to elevate the vague notion
>> of update compatibility to a first-class Beam concept and put it on
>> firm footing, but that's a larger conversation outside the thread of
>> this smaller (and I think still useful in such a future world) change.
>>
>


Re: Streaming update compatibility

2023-10-26 Thread Johanna Öjeling via dev
Hi,

I like this idea of making it easier to push out improvements, and had a
look at the PR.

One question to better understand how it works today:

   1. The upgrades that the runners do, such as those not visible to the
   user, can they be initiated at any time or do they only happen in relation
   to that the user updates the running pipeline e.g. with new user code?

And, assuming the former, some reflections that came to mind when reviewing
the changes:

   1. Will the update_compatibility_version option be effective both when
   creating and updating a pipeline? It is grouped with the update options in
   the Python SDK, but users may want to configure the compatibility already
   when launching the pipeline.
   2. Would it be possible to revert setting a fixed prior version, i.e.
   (re-)enable upgrades?
  1. If yes: in practice, would this motivate another option, or
  passing a value like "auto" or "latest" to update_compatibility_version?
   3. The option is being introduced to the Java and Python SDKs. Should
   this also be applicable to the Go SDK?

Thanks,
Johanna

On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev 
wrote:

> Dataflow (among other runners) has the ability to "upgrade" running
> pipelines with new code (e.g. capturing bug fixes, dependency updates,
> and limited topology changes). Unfortunately some improvements (e.g.
> new and improved ways of writing to BigQuery, optimized use of side
> inputs, a change in algorithm, sometimes completely internally and not
> visible to the user) are not sufficiently backwards compatible which
> causes us, with the motivation to not break users, to either not make
> these changes or guard them as a parallel opt-in mode which is a
> significant drain on both developer productivity and causes new
> pipelines to run in obsolete modes by default.
>
> I created https://github.com/apache/beam/pull/29140 which adds a new
> pipeline option, update_compatibility_version, that allows the SDK to
> move forward while letting users with pipelines launched previously to
> manually request the "old" way of doing things to preserve update
> compatibility. (We should still attempt backwards compatibility when
> it makes sense, and the old way would remain in code until such a time
> it's actually deprecated and removed, but this means we won't be
> constrained by it, especially when it comes to default settings.)
>
> Any objections or other thoughts on this approach?
>
> - Robert
>
> P.S. Separately I think it'd be valuable to elevate the vague notion
> of update compatibility to a first-class Beam concept and put it on
> firm footing, but that's a larger conversation outside the thread of
> this smaller (and I think still useful in such a future world) change.
>


Streaming update compatibility

2023-10-25 Thread Robert Bradshaw via dev
Dataflow (among other runners) has the ability to "upgrade" running
pipelines with new code (e.g. capturing bug fixes, dependency updates,
and limited topology changes). Unfortunately some improvements (e.g.
new and improved ways of writing to BigQuery, optimized use of side
inputs, a change in algorithm, sometimes completely internally and not
visible to the user) are not sufficiently backwards compatible which
causes us, with the motivation to not break users, to either not make
these changes or guard them as a parallel opt-in mode which is a
significant drain on both developer productivity and causes new
pipelines to run in obsolete modes by default.

I created https://github.com/apache/beam/pull/29140 which adds a new
pipeline option, update_compatibility_version, that allows the SDK to
move forward while letting users with pipelines launched previously to
manually request the "old" way of doing things to preserve update
compatibility. (We should still attempt backwards compatibility when
it makes sense, and the old way would remain in code until such a time
it's actually deprecated and removed, but this means we won't be
constrained by it, especially when it comes to default settings.)

Any objections or other thoughts on this approach?

- Robert

P.S. Separately I think it'd be valuable to elevate the vague notion
of update compatibility to a first-class Beam concept and put it on
firm footing, but that's a larger conversation outside the thread of
this smaller (and I think still useful in such a future world) change.