WOW! Big news.

I'm supportive of leaving experimental status after Go Modules are
completed and the LICENSE issue is resolved. I don't think that lacking
streaming support is a blocker. The other thing I checked to see was if
there were metrics available on metrics.beam.apache.org, specifically for
measuring code health via post-commit over time, which there are and the
passing test rate is high (Huzzah!). The one thing that surprised me from
your summary is that when Go introduces generics it won't result in any
backwards incompatible changes in Apache Beam. That's great news, but does
it mean there will be a need to support both non-generic and generic APIs
moving forward? It seems like generics will be introduced in the Go 1.17
release (optimistically) in August this year.



On Thu, Jun 10, 2021 at 5:04 PM Robert Burke <[email protected]> wrote:

> Hello Beam Community!
>
> I propose we stop calling the Apache Beam Go SDK experimental.
>
> This thread is to discuss it as a community, and any conditions that
> remain that would prevent the exit.
>
> *tl;dr;*
> *Ask Questions for answers and links! I have both.*
> This entails including it officially in the Release process, removing the
> various "experimental" text throughout the repo etc,
> and otherwise treating it like Python and Java. Some Go specific tasks
> around dep versioning.
>
> The Go SDK implements the beam model efficiently for most batch tasks,
> including basic windowing.
> Apache Beam Go jobs can execute, and are tested on all Portable runners.
> The core APIs are not going to change in incompatible ways going forward.
> Scalable transforms can be written through SplittableDoFns or via Cross
> Language transforms.
>
> The SDK isn't 100% feature complete, but keeping it experimental doesn't
> help with that any further.
> Communities grow through contributions and use, and experimental markers
> dissuade users.
> There's plenty to do in order expand what can be done with the SDK.
> (Contributions welcome)
>
> *Why Exit Experimental now?*
>
> Typically when we call an SDK or API Experimental, it's because there's a
> risk that API or behaviors may change significantly.
> This in turn, leads to additional work for users of the SDK on every
> release which leads to sticking to older versions or forking
> to preserve behavior. Version updates should be looked forward to, and
> viewed as having little risk. Further while there's been
> previous dicussion about what the "low bar" is for a new SDK, it hasn't
> been summarily applied to the Go SDK. I feel this has
> hurt development and contribution of new SDK languages (inherent
> difficulty of SDK development notwithstanding).
>
> When the SDK was designed, it wasn't entirely clear what the Beam Model
> should look like in an opinionated language like Go.
> Their initial take (see https://s.apache.org/beam-go-sdk-design-rfc [0])
> goes into detail what it means for a language without
> Generics, or overloading, or inheritance to implement the beam model. One
> could largely throw away static types (like Python),
> but this approach rings hollow for Go. It would not do if the approach
> couldn't grow and scale to the Beam Model. It's also hard
> to tell if an API is any good before there are users.
>
> Further, in the early days of Portability, there wasn't a way to write
> scalable DoFns, dynamically or otherwise. It's an incredible
> bottleneck to need to do all initial fanout of work on a single machine,
> write everything to a Reshuffle, just in order to scale up.
> Without being able to scale, Beam is little more than overhead.
>
> At this point, both of these needs are met within the Go SDK for open
> source.
>
> *Background*
>
> The Go SDK has been a part of the beam repo for a few years now, since it
> was accidentally merged into master.
> Since then it's been called experimental, and not officially part of the
> releases.
>
> Of the SDKs, it's was always designed around Beam Portability first. It
> never had any "Legacy" (SDK x Runner specific ) workers.
> It's always used the Beam Pipeline protos and FnAPI to execute jobs, first
> with some very experimental code on Dataflow, but now
> on all portable supported runners, like Flink, Spark, the Python Portable
> runner, and Dataflow.
>
> *API Stability*
>
> The Go SDK hasn't meaningfully changed it's user API for DoFn and pipeline
> construction since it was first merged in, and there are no
> changes to that on the horizon that can't be made in a backwards
> compatible manner. Largely these are related to New Features, or
> usability improvements enabled by the advent of Go Generics (think of
> "real" KV, emitter, and iterator types).
>
> It's an open secret that the Go SDK has largely been under work for use
> within Google. It's use is called FlumeGo, representing
> the Apache Beam Go SDK, running on top of Flume, Google's batch pipeline
> processing engine. Thus most of the focus on improving
> batch execution. FlumeGo sees ample use today, and there hasn't been a
> call for fundamental changes to the API for ergonomic or
> usability concerns.
>
> *Scalability*
>
> Google could get away without the Go SDK having an SDK side scalability
> solution as a result of it's integration with Flume.
> However, those days are now past.
>
> The Go SDK now supports SplittableDoFns along with Dynamic Splitting,
> which supports writing scalable batch transforms natively
> in the Go SDK.
> The SDK also supports Cross Language Transforms, with Beam Schema
> encodings. With it, production hardened transforms
> from Java and Python are a wrapper away.
>
> Presently, Daniel Oliveira (who implemented the SDF side work, and
> completed the Xlang work,) is adding a wrapper for the
> Java Kafka IO using Cross Language Transforms, which is often been
> requested. This will also enable use of the Beam SQL
> transforms that java enables.
>
> *Features*
>
> The Go SDK implements the Beam C=core. The Go SDK implements standard
> coders, allows for user DoFns, and CombineFns and access
> to core transforms like Flatten, GroupByKey, and features like Side
> Inputs, Windowing, and User Metrics.
> Basic windowing will be fully supported for batch even through lifted
> combines in the 2.32.0 release.
>
> All of the above enables Beam Go to be versatile for batch execution on
> portable runners, and for simple streaming pipelines.
>
> *Repo Testing*
>
> On precommit the Go SDK runs all it's unit tests. On top of that, it runs
> all it's integration tests against the Python Portable runner,
> making it quick and robust to detect breaking changes without overspending
> community resources. Those same tests are also
> run against Dataflow, Flink, and Spark.
>
> The tests are executable against all runners via the appropriate Go
> commands (if you've stood up your own job management server),
> or Gradle commands (which will spin up runner instances for you).
> Documentation for executing tests and adding new ones
> is on the wiki. [2] They are accessible to Go developers as they're
> implemented with the standard Go testing tools.
>
> *Shortcomings*
> That said, there's still much to do. Let me briefly tell you what doesn't
> work, and it's up to you to weigh whether they block
> being out of experimental.
>
> At present, only a textio has been implemented as Splittable DoFn.
> Once the Kafka wrapper is merged in, it will serve as a the first example
> for future contributions for
> new transform wrappers for the Go SDK.
> Transforms and IOs are lacking, but at this point users are empowered to
> write their own DoFns or wrap existing transforms for Cross Language use.
>
> In the core SDK, more streaming focused features have yet to be
> implemented, but they're largely additions to what exists already
> rather than total rebuilds. Much of the work is definining how a user
> specifies their desires, and turning those into the appropriate
> FnAPI requests at execution time. Back in October I wrote at length on the
> wiki [1] what's missing for additional streaming features.
>
> While we have bolstered our testing recently, there's likely still more we
> could test to improve our confidence in the SDK,
> in particular regarding the included transforms libraries and examples.
>
> *Moving Forward*
>
> My immediate plan is to work on incorporating the Go SDK fully into the
> Beam Programming Guide. I've audited the guide [3], and
> am beginning to add missing content and filling in the Go specific gaps.
> This will be tied to improving the Go Doc with more Go
> specific user documentation that isn't appropriate for the BPG.
> And resolving the LICENSE issue around the public display of that GoDoc.
>
> If this proposal is accepted by a binding vote, I will incorporate the SDK
> into the release process, and remove the "experimental"
> language around the SDK. This largely entails updating the release scripts
> to also build and publish the Go SDK Docker containers.
> As for releasing the code, we're technically already doing so whenever we
> tag a release branch [4].
>
> The clearest signal to the Go community however will be migrating the SDK
> to use Go Modules for dependency version control,
> which Daniel is planning on working on after his Kafka task. This will put
> our repo infrastructure, SDK contributors, and users
> on the same footing when it comes to dependency management. It will remove
> the "+incompatible" tags one sees on the
> pkg.go.dev list at [4].
>
> I'm very happy to answer any questions you might have about the SDK, and
> provide additional links as needed. I intentionally avoided
> a link barrage in this email, as they can distract from the point: The SDK
> is ready for folks to use it, we need to tell them that they can
> rather than they shouldn't.
>
> Robert Burke
> Defacto Beam Go TL
>
> [0] https://s.apache.org/beam-go-sdk-design-rfc
> [1]
> https://cwiki.apache.org/confluence/display/BEAM/Supporting+Streaming+in+the+Go+SDK
> [2] https://cwiki.apache.org/confluence/display/BEAM/Go+Tips
> [3]
> https://docs.google.com/spreadsheets/d/1DrBFjxPBmMMmPfeFr6jr_JndxGOes8qDqKZ2Uxwvvds/edit?resourcekey=0-tVFwcLrQ2v2jpZkHk6QOpQ#gid=2072310090
> (SDK Audit sheet)
> [4]
> https://pkg.go.dev/github.com/apache/beam/sdks/go/pkg/beam?tab=versions
>

Reply via email to