Re: Streaming update compatibility
+1 million to this. I think this could be a real game-changer. I would even more forcefully say update compatibility has pushed our development style has been pushed into the "never make significant changes" or "every significant change is wildly more complex than it should be". It forces our first draft to be our final draft, much moreso than abstraction-based backwards-compatibility, because it requires freezing many implementation details as well. And just to put more non-subjective data behind my +1, I have used this approach many times in situations where a new version of a service rolled out while still serving older clients (using URL as the flag). It is a tried-and-true technique and connecting it to Beam is like an epiphany. Hooray! The easiest way to ensure clean code is to make older versions more like straight line code, flattening out cyclomatic complexity by forking transforms at the top level. In other words FooIO.read() immediately delegates to FooIO_2_48.read(). You shouldn't be checking this flag at a bunch of separate places inside an IO. In fact I might say that should be largely forbidden and it should only be used as a "routing" flag. Kenn On Wed, Oct 25, 2023 at 8:25 PM Robert Bradshaw via dev wrote: > Dataflow (among other runners) has the ability to "upgrade" running > pipelines with new code (e.g. capturing bug fixes, dependency updates, > and limited topology changes). Unfortunately some improvements (e.g. > new and improved ways of writing to BigQuery, optimized use of side > inputs, a change in algorithm, sometimes completely internally and not > visible to the user) are not sufficiently backwards compatible which > causes us, with the motivation to not break users, to either not make > these changes or guard them as a parallel opt-in mode which is a > significant drain on both developer productivity and causes new > pipelines to run in obsolete modes by default. > > I created https://github.com/apache/beam/pull/29140 which adds a new > pipeline option, update_compatibility_version, that allows the SDK to > move forward while letting users with pipelines launched previously to > manually request the "old" way of doing things to preserve update > compatibility. (We should still attempt backwards compatibility when > it makes sense, and the old way would remain in code until such a time > it's actually deprecated and removed, but this means we won't be > constrained by it, especially when it comes to default settings.) > > Any objections or other thoughts on this approach? > > - Robert > > P.S. Separately I think it'd be valuable to elevate the vague notion > of update compatibility to a first-class Beam concept and put it on > firm footing, but that's a larger conversation outside the thread of > this smaller (and I think still useful in such a future world) change. >
Re: Streaming update compatibility
On Fri, Oct 27, 2023, 9:09 AM Robert Bradshaw via dev wrote: > On Fri, Oct 27, 2023 at 7:50 AM Kellen Dye via dev > wrote: > > > > > Auto is hard, because it would involve > > > querying the runner before pipeline construction, and we may not even > > > know what the runner is at this point > > > > At the point where pipeline construction will start, you should have > access to the pipeline arguments and be able to determine the runner. What > seems to be missing is a place to query the runner pre-construction. If > that query could return metadata about the currently running version of the > job, then that could be incorporated into graph construction as necessary. > > While this is the common case, it is not true in general. For example > it's possible to cache the pipeline proto and submit it to a separate > choice of runner later. We have Jobs API implementations that > forward/proxy the job to other runners, and the Python interactive > runner is another example where the runner is late-binding (e.g. one > tries a sample locally, and if all looks good can execute remotely, > and also in this case the graph that's submitted is often mutated > before running). > > Also, in the spirit of the portability story, the pipeline definition > itself should be runner-independent. > > > That same hook could be a place to for example return the > currently-running job graph for pre-submission compatibility checks. > > I suppose we could add something to the Jobs API to make "looking up a > previous version of this pipeline" runner-agnostic, though that > assumes it's available at construction time. As I pointed out, we can access a given pipeline via the job management API. It's already runner agnostic other than Dataflow. Operationally though, we'd need to provide the option to "dry run" an update locally, or validate update compatibility against a given pipeline proto. And +1 as Kellen says we > should define (and be able to check) what pipeline compatibility means > in a via graph-to-graph comparison at the Beam level. I'll defer both > of these as future work as part of the "make update a portable Beam > concept" project. > Big +1 to that. Hard to know what to check for without defining it. This would avoid needing to ask a given runner WRT dry run updates. It's on a longer term plan, but I have intended to add Pipeline Update as a feature to Prism. As it becomes more fully featured, it becomes a great test bed to develop the definitions. >
Re: Streaming update compatibility
On Fri, Oct 27, 2023 at 7:50 AM Kellen Dye via dev wrote: > > > Auto is hard, because it would involve > > querying the runner before pipeline construction, and we may not even > > know what the runner is at this point > > At the point where pipeline construction will start, you should have access > to the pipeline arguments and be able to determine the runner. What seems to > be missing is a place to query the runner pre-construction. If that query > could return metadata about the currently running version of the job, then > that could be incorporated into graph construction as necessary. While this is the common case, it is not true in general. For example it's possible to cache the pipeline proto and submit it to a separate choice of runner later. We have Jobs API implementations that forward/proxy the job to other runners, and the Python interactive runner is another example where the runner is late-binding (e.g. one tries a sample locally, and if all looks good can execute remotely, and also in this case the graph that's submitted is often mutated before running). Also, in the spirit of the portability story, the pipeline definition itself should be runner-independent. > That same hook could be a place to for example return the currently-running > job graph for pre-submission compatibility checks. I suppose we could add something to the Jobs API to make "looking up a previous version of this pipeline" runner-agnostic, though that assumes it's available at construction time. And +1 as Kellen says we should define (and be able to check) what pipeline compatibility means in a via graph-to-graph comparison at the Beam level. I'll defer both of these as future work as part of the "make update a portable Beam concept" project.
Re: Streaming update compatibility
In Spotify's case we deploy streaming jobs via CI and would ideally verify compatibility as part of the build process before submitting to dataflow. Perhaps decoupled from the _running_ pipeline if we had a cache of previous pipeline versions. Currently the user experience is poor because any merge of a change ends up being potentially distant in time from the job submission and failure due to incompatibility. There's the additional friction of a few layers of infrastructure between the developer and the job failure logs which means we can't trivially deliver the failure reason to the developer. On Fri, Oct 27, 2023 at 11:07 AM Robert Burke wrote: > You raise a very good point: > > > https://github.com/apache/beam/blob/master/model/job-management/src/main/proto/org/apache/beam/model/job_management/v1/beam_job_api.proto#L54 > > The job management API does allow for the pipeline proto to be returned. > So one could get the live job, so the SDK could make reasonable decisions > before sending to the runner. > > Dataflow does have a similar API that xan be adapted. > > I am a touch concerned about spreading the update compatibility checks > around between SDKs and Runners though. But in some cases it would be > easier for the SDK, eg to ensure VersionA of a transform is used vs > VersionB, based on the existing transforma used in the job being updated. > > > On Fri, Oct 27, 2023, 7:50 AM Kellen Dye via dev > wrote: > >> > Auto is hard, because it would involve >> > querying the runner before pipeline construction, and we may not even >> > know what the runner is at this point >> >> At the point where pipeline construction will start, you should have >> access to the pipeline arguments and be able to determine the runner. What >> seems to be missing is a place to query the runner pre-construction. If >> that query could return metadata about the currently running version of the >> job, then that could be incorporated into graph construction as necessary. >> >> That same hook could be a place to for example return the >> currently-running job graph for pre-submission compatibility checks. >> >> >>
Re: Streaming update compatibility
You raise a very good point: https://github.com/apache/beam/blob/master/model/job-management/src/main/proto/org/apache/beam/model/job_management/v1/beam_job_api.proto#L54 The job management API does allow for the pipeline proto to be returned. So one could get the live job, so the SDK could make reasonable decisions before sending to the runner. Dataflow does have a similar API that xan be adapted. I am a touch concerned about spreading the update compatibility checks around between SDKs and Runners though. But in some cases it would be easier for the SDK, eg to ensure VersionA of a transform is used vs VersionB, based on the existing transforma used in the job being updated. On Fri, Oct 27, 2023, 7:50 AM Kellen Dye via dev wrote: > > Auto is hard, because it would involve > > querying the runner before pipeline construction, and we may not even > > know what the runner is at this point > > At the point where pipeline construction will start, you should have > access to the pipeline arguments and be able to determine the runner. What > seems to be missing is a place to query the runner pre-construction. If > that query could return metadata about the currently running version of the > job, then that could be incorporated into graph construction as necessary. > > That same hook could be a place to for example return the > currently-running job graph for pre-submission compatibility checks. > > >
Re: Streaming update compatibility
> Auto is hard, because it would involve > querying the runner before pipeline construction, and we may not even > know what the runner is at this point At the point where pipeline construction will start, you should have access to the pipeline arguments and be able to determine the runner. What seems to be missing is a place to query the runner pre-construction. If that query could return metadata about the currently running version of the job, then that could be incorporated into graph construction as necessary. That same hook could be a place to for example return the currently-running job graph for pre-submission compatibility checks.
Re: Streaming update compatibility
Alright, then it is clearer. Thank you for your answers! On Thu, Oct 26, 2023, 20:36 Robert Bradshaw wrote: > On Thu, Oct 26, 2023 at 3:59 AM Johanna Öjeling > wrote: > > > > Hi, > > > > I like this idea of making it easier to push out improvements, and had a > look at the PR. > > > > One question to better understand how it works today: > > > > The upgrades that the runners do, such as those not visible to the user, > can they be initiated at any time or do they only happen in relation to > that the user updates the running pipeline e.g. with new user code? > > Correct. We're talking about user-initiated changes to their pipeline here. > > > And, assuming the former, some reflections that came to mind when > reviewing the changes: > > > > Will the update_compatibility_version option be effective both when > creating and updating a pipeline? It is grouped with the update options in > the Python SDK, but users may want to configure the compatibility already > when launching the pipeline. > > It will be effective for both, though generally there's little > motivation to not always use the "latest" version when creating a new > pipeline. > > > Would it be possible to revert setting a fixed prior version, i.e. > (re-)enable upgrades? > > The contract would be IF you start with version X (which logically > defaults to the current SDK), THEN all updates also setting this to > version X (even on SDKs > X) should work. > > > If yes: in practice, would this motivate another option, or passing a > value like "auto" or "latest" to update_compatibility_version? > > Unset is interpreted as latest. Auto is hard, because it would involve > querying the runner before pipeline construction, and we may not even > know what the runner is at this point. (Eventually we could do things > like embed both alternative into the graph and let the runner choose, > but this is more speculative and may not be as scalable.) > > > The option is being introduced to the Java and Python SDKs. Should this > also be applicable to the Go SDK? > > Yes, allowing setting this value should be done for Go (and > typescript, and future SDKs) too. As Robert Burke mentioned, we need > to respect the value in those SDKs that have expansion service > implementations first. > > > On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> > >> Dataflow (among other runners) has the ability to "upgrade" running > >> pipelines with new code (e.g. capturing bug fixes, dependency updates, > >> and limited topology changes). Unfortunately some improvements (e.g. > >> new and improved ways of writing to BigQuery, optimized use of side > >> inputs, a change in algorithm, sometimes completely internally and not > >> visible to the user) are not sufficiently backwards compatible which > >> causes us, with the motivation to not break users, to either not make > >> these changes or guard them as a parallel opt-in mode which is a > >> significant drain on both developer productivity and causes new > >> pipelines to run in obsolete modes by default. > >> > >> I created https://github.com/apache/beam/pull/29140 which adds a new > >> pipeline option, update_compatibility_version, that allows the SDK to > >> move forward while letting users with pipelines launched previously to > >> manually request the "old" way of doing things to preserve update > >> compatibility. (We should still attempt backwards compatibility when > >> it makes sense, and the old way would remain in code until such a time > >> it's actually deprecated and removed, but this means we won't be > >> constrained by it, especially when it comes to default settings.) > >> > >> Any objections or other thoughts on this approach? > >> > >> - Robert > >> > >> P.S. Separately I think it'd be valuable to elevate the vague notion > >> of update compatibility to a first-class Beam concept and put it on > >> firm footing, but that's a larger conversation outside the thread of > >> this smaller (and I think still useful in such a future world) change. >
Re: Streaming update compatibility
On Thu, Oct 26, 2023 at 3:59 AM Johanna Öjeling wrote: > > Hi, > > I like this idea of making it easier to push out improvements, and had a look > at the PR. > > One question to better understand how it works today: > > The upgrades that the runners do, such as those not visible to the user, can > they be initiated at any time or do they only happen in relation to that the > user updates the running pipeline e.g. with new user code? Correct. We're talking about user-initiated changes to their pipeline here. > And, assuming the former, some reflections that came to mind when reviewing > the changes: > > Will the update_compatibility_version option be effective both when creating > and updating a pipeline? It is grouped with the update options in the Python > SDK, but users may want to configure the compatibility already when launching > the pipeline. It will be effective for both, though generally there's little motivation to not always use the "latest" version when creating a new pipeline. > Would it be possible to revert setting a fixed prior version, i.e. > (re-)enable upgrades? The contract would be IF you start with version X (which logically defaults to the current SDK), THEN all updates also setting this to version X (even on SDKs > X) should work. > If yes: in practice, would this motivate another option, or passing a value > like "auto" or "latest" to update_compatibility_version? Unset is interpreted as latest. Auto is hard, because it would involve querying the runner before pipeline construction, and we may not even know what the runner is at this point. (Eventually we could do things like embed both alternative into the graph and let the runner choose, but this is more speculative and may not be as scalable.) > The option is being introduced to the Java and Python SDKs. Should this also > be applicable to the Go SDK? Yes, allowing setting this value should be done for Go (and typescript, and future SDKs) too. As Robert Burke mentioned, we need to respect the value in those SDKs that have expansion service implementations first. > On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev > wrote: >> >> Dataflow (among other runners) has the ability to "upgrade" running >> pipelines with new code (e.g. capturing bug fixes, dependency updates, >> and limited topology changes). Unfortunately some improvements (e.g. >> new and improved ways of writing to BigQuery, optimized use of side >> inputs, a change in algorithm, sometimes completely internally and not >> visible to the user) are not sufficiently backwards compatible which >> causes us, with the motivation to not break users, to either not make >> these changes or guard them as a parallel opt-in mode which is a >> significant drain on both developer productivity and causes new >> pipelines to run in obsolete modes by default. >> >> I created https://github.com/apache/beam/pull/29140 which adds a new >> pipeline option, update_compatibility_version, that allows the SDK to >> move forward while letting users with pipelines launched previously to >> manually request the "old" way of doing things to preserve update >> compatibility. (We should still attempt backwards compatibility when >> it makes sense, and the old way would remain in code until such a time >> it's actually deprecated and removed, but this means we won't be >> constrained by it, especially when it comes to default settings.) >> >> Any objections or other thoughts on this approach? >> >> - Robert >> >> P.S. Separately I think it'd be valuable to elevate the vague notion >> of update compatibility to a first-class Beam concept and put it on >> firm footing, but that's a larger conversation outside the thread of >> this smaller (and I think still useful in such a future world) change.
Re: Streaming update compatibility
Regarding 3. I suspect Go wasn't changed because the PR is centering around implementations of the Expansion Service server, not client callers. The Go SDK doesn't yet have an expansion service. On Thu, Oct 26, 2023, 3:59 AM Johanna Öjeling via dev wrote: > Hi, > > I like this idea of making it easier to push out improvements, and had a > look at the PR. > > One question to better understand how it works today: > >1. The upgrades that the runners do, such as those not visible to the >user, can they be initiated at any time or do they only happen in relation >to that the user updates the running pipeline e.g. with new user code? > > And, assuming the former, some reflections that came to mind when > reviewing the changes: > >1. Will the update_compatibility_version option be effective both when >creating and updating a pipeline? It is grouped with the update options in >the Python SDK, but users may want to configure the compatibility already >when launching the pipeline. >2. Would it be possible to revert setting a fixed prior version, i.e. >(re-)enable upgrades? > 1. If yes: in practice, would this motivate another option, or > passing a value like "auto" or "latest" to update_compatibility_version? >3. The option is being introduced to the Java and Python SDKs. Should >this also be applicable to the Go SDK? > > Thanks, > Johanna > > On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> Dataflow (among other runners) has the ability to "upgrade" running >> pipelines with new code (e.g. capturing bug fixes, dependency updates, >> and limited topology changes). Unfortunately some improvements (e.g. >> new and improved ways of writing to BigQuery, optimized use of side >> inputs, a change in algorithm, sometimes completely internally and not >> visible to the user) are not sufficiently backwards compatible which >> causes us, with the motivation to not break users, to either not make >> these changes or guard them as a parallel opt-in mode which is a >> significant drain on both developer productivity and causes new >> pipelines to run in obsolete modes by default. >> >> I created https://github.com/apache/beam/pull/29140 which adds a new >> pipeline option, update_compatibility_version, that allows the SDK to >> move forward while letting users with pipelines launched previously to >> manually request the "old" way of doing things to preserve update >> compatibility. (We should still attempt backwards compatibility when >> it makes sense, and the old way would remain in code until such a time >> it's actually deprecated and removed, but this means we won't be >> constrained by it, especially when it comes to default settings.) >> >> Any objections or other thoughts on this approach? >> >> - Robert >> >> P.S. Separately I think it'd be valuable to elevate the vague notion >> of update compatibility to a first-class Beam concept and put it on >> firm footing, but that's a larger conversation outside the thread of >> this smaller (and I think still useful in such a future world) change. >> >
Re: Streaming update compatibility
Hi, I like this idea of making it easier to push out improvements, and had a look at the PR. One question to better understand how it works today: 1. The upgrades that the runners do, such as those not visible to the user, can they be initiated at any time or do they only happen in relation to that the user updates the running pipeline e.g. with new user code? And, assuming the former, some reflections that came to mind when reviewing the changes: 1. Will the update_compatibility_version option be effective both when creating and updating a pipeline? It is grouped with the update options in the Python SDK, but users may want to configure the compatibility already when launching the pipeline. 2. Would it be possible to revert setting a fixed prior version, i.e. (re-)enable upgrades? 1. If yes: in practice, would this motivate another option, or passing a value like "auto" or "latest" to update_compatibility_version? 3. The option is being introduced to the Java and Python SDKs. Should this also be applicable to the Go SDK? Thanks, Johanna On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev wrote: > Dataflow (among other runners) has the ability to "upgrade" running > pipelines with new code (e.g. capturing bug fixes, dependency updates, > and limited topology changes). Unfortunately some improvements (e.g. > new and improved ways of writing to BigQuery, optimized use of side > inputs, a change in algorithm, sometimes completely internally and not > visible to the user) are not sufficiently backwards compatible which > causes us, with the motivation to not break users, to either not make > these changes or guard them as a parallel opt-in mode which is a > significant drain on both developer productivity and causes new > pipelines to run in obsolete modes by default. > > I created https://github.com/apache/beam/pull/29140 which adds a new > pipeline option, update_compatibility_version, that allows the SDK to > move forward while letting users with pipelines launched previously to > manually request the "old" way of doing things to preserve update > compatibility. (We should still attempt backwards compatibility when > it makes sense, and the old way would remain in code until such a time > it's actually deprecated and removed, but this means we won't be > constrained by it, especially when it comes to default settings.) > > Any objections or other thoughts on this approach? > > - Robert > > P.S. Separately I think it'd be valuable to elevate the vague notion > of update compatibility to a first-class Beam concept and put it on > firm footing, but that's a larger conversation outside the thread of > this smaller (and I think still useful in such a future world) change. >
Streaming update compatibility
Dataflow (among other runners) has the ability to "upgrade" running pipelines with new code (e.g. capturing bug fixes, dependency updates, and limited topology changes). Unfortunately some improvements (e.g. new and improved ways of writing to BigQuery, optimized use of side inputs, a change in algorithm, sometimes completely internally and not visible to the user) are not sufficiently backwards compatible which causes us, with the motivation to not break users, to either not make these changes or guard them as a parallel opt-in mode which is a significant drain on both developer productivity and causes new pipelines to run in obsolete modes by default. I created https://github.com/apache/beam/pull/29140 which adds a new pipeline option, update_compatibility_version, that allows the SDK to move forward while letting users with pipelines launched previously to manually request the "old" way of doing things to preserve update compatibility. (We should still attempt backwards compatibility when it makes sense, and the old way would remain in code until such a time it's actually deprecated and removed, but this means we won't be constrained by it, especially when it comes to default settings.) Any objections or other thoughts on this approach? - Robert P.S. Separately I think it'd be valuable to elevate the vague notion of update compatibility to a first-class Beam concept and put it on firm footing, but that's a larger conversation outside the thread of this smaller (and I think still useful in such a future world) change.