Re: [Go SDK] Direct Runner Replacement: Prism

Robert Burke Thu, 16 Feb 2023 15:02:28 -0800

Next up:

https://github.com/apache/beam/pull/25478 - Large PR for initial handling of 
worker FnAPI surfaces. 
https://github.com/apache/beam/pull/25518 - Tiny PR for handling basic 
windowing strategies.
https://github.com/apache/beam/pull/25520 - Medium PR for adding the graph 
preprocessor scaffolding



On 2023/02/15 05:41:51 Robert Burke via dev wrote:
> Here are the next two chunks!
> 
> https://github.com/apache/beam/pull/25476 - Coder / element / bytes
> handling internally for prism.
> https://github.com/apache/beam/pull/25478 - Worker fnAPI handling.
> 
> Took a bit to get a baseline of unit testing in for these, since they were
> covered by whole pipeline runs.
> Coders in particular, since they currently live in the package with the
> pipeline tests, so it was harder to ensure
> coverage in a vacuum.
> 
> But they did force a bit of documentation improvements, and a neglected
> inefficiency I had in the original coder structure.
> 
> So small pain now, but will make sure future development is a bit easier,
> as convenient as "just write a pipeline" is for testing.
> Sometimes you just want to ensure the protocol works.
> 
> On Thu, Feb 9, 2023 at 2:50 PM Kenneth Knowles <[email protected]> wrote:
> 
> > Just a +100 to the idea of this runner. Having an easy-to-read,
> > portable-execution, batch & streaming, parallel, local runner, that
> > exercises plenty of advanced model features... solid gold!
> >
> > On Thu, Feb 9, 2023 at 12:01 PM Robert Burke via dev <[email protected]>
> > wrote:
> >
> >> Here are the first of the smaller PRs:
> >>
> >> https://github.com/apache/beam/pull/25404 -> Adds READMEs and updates
> >> go.mod so later changes don't collide there.
> >> https://github.com/apache/beam/pull/25405 -> Adds internal/urns package
> >> for extracting URNs from the protos.
> >> https://github.com/apache/beam/pull/25406 -> Adds internal/config
> >> package for parsing and accessing the configuration of variants and
> >> handlers in the runner.
> >>
> >> These are independant changes, and small enough for quicker review. The
> >> remaining larger packages can be submitted more piecemeal once these are 
> >> in.
> >>
> >>
> >>
> >> On Wed, Feb 8, 2023 at 3:23 PM Robert Burke <[email protected]> wrote:
> >>
> >>> Hello Beam!
> >>>
> >>> == tl;dr; ==
> >>>
> >>> I wrote a local, portable Beam runner in Go to replace the Go direct
> >>> runner.  I'd like to contribute it to the Beam Repo. The Big PR with
> >>> everything is here: https://github.com/apache/beam/pull/25391
> >>>
> >>> I'll be sending smaller PRs out for review to get it into the repo. Take
> >>> a look at the big one, don't mind the mess, but do ask questions, or offer
> >>> constructive suggestions to make it clearer. There are ample TODOs that
> >>> could be added. This thread will be kept up to date with the progress.
> >>>
> >>> Highlights:
> >>> Avoids false positive issues the Go Direct runner has, especially around
> >>> serialization issues.
> >>> Single transform at a time execution.
> >>> Watermark propagation through Graph for GBKs and Side Input windowing.
> >>> Will be capable of testing the whole Go SDK, in time.
> >>> Will be capable of being a stand alone single binary runner, in time.
> >>> ++Many opportunities for contribution after getting into the repo!++
> >>>
> >>> Lowlights:
> >>> Only for Go SDK, for now.
> >>> ~~Many unimplemented features~~
> >>>
> >>> Where to start reading?
> >>>
> >>> Vision README:
> >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
> >>>
> >>>
> >>> Code Structure README:
> >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
> >>>
> >>>
> >>> executePipeline entrypoint:
> >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
> >>>
> >>>
> >>>
> >>> == The long version ==
> >>>
> >>> Since last year, I was puttering away at making a Portable Beam Runner
> >>> authored in Go. Partly because I wanted to learn the "runner" half of 
> >>> beam,
> >>> and partly because the Go Direct Runner (and most other direct runners),
> >>> are not good at testing.
> >>>
> >>> I managed to get it roughly ready for basic batch execution by end of
> >>> February 2022 , and then 2022 got away from me. And I couldn't pick it up
> >>> until the end of the year.
> >>>
> >>> I gave a talk about this at Beam Summit 2022
> >>> https://2022.beamsummit.org/sessions/portable-go-beam-runner/ that
> >>> covers my motivation for it. Loosely, Beam has a Testing Problem. There 
> >>> are
> >>> large parts of Beam execution that matter for real world performance and
> >>> correctness, but the facilities to test these don't exist.  For example,
> >>> take Combiner Lifting, if a combiner is unlifted, but implements
> >>> AddInput... then Merge is never called, leaving it untested. And the user
> >>> has no control over this, or may not even be aware of it. How a DoFn is
> >>> executed matters for coverage, and user confidence.  In particular for
> >>> Streaming jobs, users will tend to try things out on their Prod runner, 
> >>> but
> >>> that doesn't help if one is testing on local Flink, but executing on 
> >>> Google
> >>> Cloud Dataflow, which behave very differently.
> >>>
> >>> Regardless of whether you agree with that thesis...  I wanted to fill
> >>> that gap. I wanted a runner that could be configured to test those
> >>> situations, and in particular, make it easier to develop SDKs and all the
> >>> features of Beam that don't get their own blog posts.
> >>>
> >>> Especially for the Go SDK. Java, being the oldest, has arguably the only
> >>> "correct" beam runner, in the form of the Java Direct Runner. But one 
> >>> can't
> >>> execute Go pipelines on that. Python has a portable execution of its
> >>> runner, but the current state of python is Parallelism hostile at best. It
> >>> supports a great many things, like Cross Language, but can't support
> >>> streaming execution (ProcessContinations etc) at present. Also, being a
> >>> large Python program, it's harder to follow.  The Java Direct runner, 
> >>> while
> >>> being slightly easier to follow, doesn't have a clear execution flow.
> >>> Neither of them are particularly easy for Non Language Experts to stand up
> >>> and use, especially outside of the Beam repo.
> >>>
> >>> The Go SDK's Direct Runner has many flaws, most of which are due to
> >>> Direct execution, rather than Portable Execution.  Implementing features
> >>> largely meant hacking certain things in, so they would be able to be
> >>> executed. This also made supporting and testing Cross Language Transforms,
> >>> State and Timers in Go pipelines a non-starter for users. And that's just
> >>> the tip.
> >>>
> >>> So I wanted something better. I mentioned it a few times to others, but
> >>> I kept hearing the same refrain: "I want something that does that". Or at
> >>> least they wanted something simpler to understand to hack against
> >>> themselves.
> >>>
> >>> I added more tests, and implemented more features, filed a tracking
> >>> issue (https://github.com/apache/beam/issues/24789), re-wrote the whole
> >>> engine from scratch to actually support watermarks, and terminating 
> >>> Process
> >>> Continuations. In the process I found and fixed some bugs in the Go SDK
> >>> too. Many conversations were happening so I could understand how things 
> >>> are
> >>> supposed to work.
> >>>
> >>> At some point, chatting with Jack McCluskey (@jrmccluskey) we came up
> >>> with the name Prism (among many many other contenders). I wanted the 
> >>> runner
> >>> to be able to split up and examine the different components of Beam and
> >>> combine them in different ways, and Prisms do that for Beams of light. The
> >>> name stuck as the best one. Formally, it will be the Apache Beam Prism
> >>> Runner. It can always be argued about
> >>>
> >>> And then I had to stop and clean it up. If I kept going. eventually it
> >>> becomes too big to review. I've spent the last month, reorganizing the 
> >>> code
> >>> ~6600+ lines of code and comments. I hope it's clear enough for others to
> >>> follow at this point, and the initial review PRs will help keep things
> >>> small.
> >>>
> >>> My expectations for now are to send out this email and have people take
> >>> a first look at the BIG has everything PR:
> >>> https://github.com/apache/beam/pull/25391. That branch will be
> >>> canonical for other changes until everything is in the beam repo. It will
> >>> be kept up to date with changes in the smaller PRs.
> >>>
> >>> If you'd like a place to start, I recommend the main README.md which has
> >>> the rationale and goals.
> >>>
> >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
> >>>
> >>>
> >>> Follow that with the structure README.md in the internal directory.
> >>>
> >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
> >>>
> >>> Finally, see how the (post job submission) part of how a job executes,
> >>> starting with the executePipeline function:
> >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
> >>>
> >>>
> >>> In the meantime, as the code is going out for review, I'll be improving
> >>> Unit Test coverage of the sub packages in the structure. When it was still
> >>> a mono-package though, ~85% coverage was achieved via the test pipelines.
> >>>
> >>> In the medium term, I'd like to get it working standalone, so any user
> >>> with a Go install can get a working job runner with a quick `go install
> >>> github.com/apache/beam/sdks/go/cmd/prism; prism`, and have it work
> >>> indefinitely with tiny scale streaming pipelines.
> >>>
> >>> Longer term, I'd like to get the Java Validates Runner tests executing
> >>> on it, which will properly validate correctness of details I'm not aware
> >>> of, and are not covered in the Go SDK integration tests.
> >>>
> >>> As stated, the primary purpose is to simplify testing of Beam pipelines
> >>> (especially for the Go SDK) and SDK development. As long as it can be used
> >>> for that, it can also be expanded to do more in time. It's not expected to
> >>> become "the best" runner, let alone a distributed runner.
> >>>
> >>> I look forward to your thoughts!
> >>>
> >>> Robert Burke
> >>> Beam Go Busy Body
> >>>
> >>>
>

Re: [Go SDK] Direct Runner Replacement: Prism

Reply via email to