Re: [Go SDK] Direct Runner Replacement: Prism

Robert Burke Mon, 20 Feb 2023 15:31:12 -0800

Johanna managed to take a look at the penultimate PR, so that's now in, and
finally, here's the last one.


https://github.com/apache/beam/pull/25568 - Add in the execution stack
connecting preprocessing and execution. Includes all pipeline execution
unit tests.

On Mon, 20 Feb 2023 at 10:26, Robert Burke <rob...@frantil.com> wrote:

> We're in the home stretch! Only 2 PRs to go, and then Prism will be
> available for use from the Beam Repo, hopefully with the 2.46 cut. Also, it
> would be available for others to improve and meaningfully extend.
>
> https://github.com/apache/beam/pull/25565 - Adds in the primary element
> manager. This is where watermark handling and bundling decisions are made
> for stages.
>
> Once that PR is in, there's the execution handling which puts everything
> together, and the entry point, and the suite of pipeline execution tests.
>
> Robert Burke
> Beam Go Busybody
>
> On Sun, 19 Feb 2023 at 18:25, Robert Burke <rob...@frantil.com> wrote:
>
>> I had to scratch the itch this weekend, and now have Prism able to pay
>> attention to Estimated Output Watermarks, and send elements downstream if
>> the transform is not blocked
>> by watermarks somewhere. Woohoo! Is it fully correct?  Probably not, but
>> it's a start.
>>
>> Transforms are still executed one at a time, but this is currently a
>> matter of implementation, rather than inability. Changing that
>> implementation will occur with progress tracking and split request
>> handling, which will be required by the separation harness tests.
>>
>> All data is presently cached and stored indefinitely in memory which
>> prevents indefinite runs of a fully unbounded pipeline. Primary inputs
>> consume & garbage collect data as the pipeline advances, but since side
>> inputs are re-used they use their own thing, and the system isn't aware
>> when it can be safely garbage collected.
>>
>> Big Everything PR is at https://github.com/apache/beam/pull/25391
>>
>> Next up PRs:
>>
>> https://github.com/apache/beam/pull/25556 - Remaining initial job
>> services.
>> https://github.com/apache/beam/pull/25557 - Test DoFns for later use in
>> pipelines.
>> https://github.com/apache/beam/pull/25558 - Handling graph
>> transformations for Combine & SDF composites, executing Flattens and GBKs
>>
>> On Thu, 16 Feb 2023 at 15:02, Robert Burke <lostl...@apache.org> wrote:
>>
>>> Next up:
>>>
>>> https://github.com/apache/beam/pull/25478 - Large PR for initial
>>> handling of worker FnAPI surfaces.
>>> https://github.com/apache/beam/pull/25518 - Tiny PR for handling basic
>>> windowing strategies.
>>> https://github.com/apache/beam/pull/25520 - Medium PR for adding the
>>> graph preprocessor scaffolding
>>>
>>>
>>> On 2023/02/15 05:41:51 Robert Burke via dev wrote:
>>> > Here are the next two chunks!
>>> >
>>> > https://github.com/apache/beam/pull/25476 - Coder / element / bytes
>>> > handling internally for prism.
>>> > https://github.com/apache/beam/pull/25478 - Worker fnAPI handling.
>>> >
>>> > Took a bit to get a baseline of unit testing in for these, since they
>>> were
>>> > covered by whole pipeline runs.
>>> > Coders in particular, since they currently live in the package with the
>>> > pipeline tests, so it was harder to ensure
>>> > coverage in a vacuum.
>>> >
>>> > But they did force a bit of documentation improvements, and a neglected
>>> > inefficiency I had in the original coder structure.
>>> >
>>> > So small pain now, but will make sure future development is a bit
>>> easier,
>>> > as convenient as "just write a pipeline" is for testing.
>>> > Sometimes you just want to ensure the protocol works.
>>> >
>>> > On Thu, Feb 9, 2023 at 2:50 PM Kenneth Knowles <k...@apache.org>
>>> wrote:
>>> >
>>> > > Just a +100 to the idea of this runner. Having an easy-to-read,
>>> > > portable-execution, batch & streaming, parallel, local runner, that
>>> > > exercises plenty of advanced model features... solid gold!
>>> > >
>>> > > On Thu, Feb 9, 2023 at 12:01 PM Robert Burke via dev <
>>> dev@beam.apache.org>
>>> > > wrote:
>>> > >
>>> > >> Here are the first of the smaller PRs:
>>> > >>
>>> > >> https://github.com/apache/beam/pull/25404 -> Adds READMEs and
>>> updates
>>> > >> go.mod so later changes don't collide there.
>>> > >> https://github.com/apache/beam/pull/25405 -> Adds internal/urns
>>> package
>>> > >> for extracting URNs from the protos.
>>> > >> https://github.com/apache/beam/pull/25406 -> Adds internal/config
>>> > >> package for parsing and accessing the configuration of variants and
>>> > >> handlers in the runner.
>>> > >>
>>> > >> These are independant changes, and small enough for quicker review.
>>> The
>>> > >> remaining larger packages can be submitted more piecemeal once
>>> these are in.
>>> > >>
>>> > >>
>>> > >>
>>> > >> On Wed, Feb 8, 2023 at 3:23 PM Robert Burke <lostl...@apache.org>
>>> wrote:
>>> > >>
>>> > >>> Hello Beam!
>>> > >>>
>>> > >>> == tl;dr; ==
>>> > >>>
>>> > >>> I wrote a local, portable Beam runner in Go to replace the Go
>>> direct
>>> > >>> runner.  I'd like to contribute it to the Beam Repo. The Big PR
>>> with
>>> > >>> everything is here: https://github.com/apache/beam/pull/25391
>>> > >>>
>>> > >>> I'll be sending smaller PRs out for review to get it into the
>>> repo. Take
>>> > >>> a look at the big one, don't mind the mess, but do ask questions,
>>> or offer
>>> > >>> constructive suggestions to make it clearer. There are ample TODOs
>>> that
>>> > >>> could be added. This thread will be kept up to date with the
>>> progress.
>>> > >>>
>>> > >>> Highlights:
>>> > >>> Avoids false positive issues the Go Direct runner has, especially
>>> around
>>> > >>> serialization issues.
>>> > >>> Single transform at a time execution.
>>> > >>> Watermark propagation through Graph for GBKs and Side Input
>>> windowing.
>>> > >>> Will be capable of testing the whole Go SDK, in time.
>>> > >>> Will be capable of being a stand alone single binary runner, in
>>> time.
>>> > >>> ++Many opportunities for contribution after getting into the
>>> repo!++
>>> > >>>
>>> > >>> Lowlights:
>>> > >>> Only for Go SDK, for now.
>>> > >>> ~~Many unimplemented features~~
>>> > >>>
>>> > >>> Where to start reading?
>>> > >>>
>>> > >>> Vision README:
>>> > >>>
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
>>> > >>>
>>> > >>>
>>> > >>> Code Structure README:
>>> > >>>
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
>>> > >>>
>>> > >>>
>>> > >>> executePipeline entrypoint:
>>> > >>>
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> == The long version ==
>>> > >>>
>>> > >>> Since last year, I was puttering away at making a Portable Beam
>>> Runner
>>> > >>> authored in Go. Partly because I wanted to learn the "runner" half
>>> of beam,
>>> > >>> and partly because the Go Direct Runner (and most other direct
>>> runners),
>>> > >>> are not good at testing.
>>> > >>>
>>> > >>> I managed to get it roughly ready for basic batch execution by end
>>> of
>>> > >>> February 2022 , and then 2022 got away from me. And I couldn't
>>> pick it up
>>> > >>> until the end of the year.
>>> > >>>
>>> > >>> I gave a talk about this at Beam Summit 2022
>>> > >>> https://2022.beamsummit.org/sessions/portable-go-beam-runner/ that
>>> > >>> covers my motivation for it. Loosely, Beam has a Testing Problem.
>>> There are
>>> > >>> large parts of Beam execution that matter for real world
>>> performance and
>>> > >>> correctness, but the facilities to test these don't exist.  For
>>> example,
>>> > >>> take Combiner Lifting, if a combiner is unlifted, but implements
>>> > >>> AddInput... then Merge is never called, leaving it untested. And
>>> the user
>>> > >>> has no control over this, or may not even be aware of it. How a
>>> DoFn is
>>> > >>> executed matters for coverage, and user confidence.  In particular
>>> for
>>> > >>> Streaming jobs, users will tend to try things out on their Prod
>>> runner, but
>>> > >>> that doesn't help if one is testing on local Flink, but executing
>>> on Google
>>> > >>> Cloud Dataflow, which behave very differently.
>>> > >>>
>>> > >>> Regardless of whether you agree with that thesis...  I wanted to
>>> fill
>>> > >>> that gap. I wanted a runner that could be configured to test those
>>> > >>> situations, and in particular, make it easier to develop SDKs and
>>> all the
>>> > >>> features of Beam that don't get their own blog posts.
>>> > >>>
>>> > >>> Especially for the Go SDK. Java, being the oldest, has arguably
>>> the only
>>> > >>> "correct" beam runner, in the form of the Java Direct Runner. But
>>> one can't
>>> > >>> execute Go pipelines on that. Python has a portable execution of
>>> its
>>> > >>> runner, but the current state of python is Parallelism hostile at
>>> best. It
>>> > >>> supports a great many things, like Cross Language, but can't
>>> support
>>> > >>> streaming execution (ProcessContinations etc) at present. Also,
>>> being a
>>> > >>> large Python program, it's harder to follow.  The Java Direct
>>> runner, while
>>> > >>> being slightly easier to follow, doesn't have a clear execution
>>> flow.
>>> > >>> Neither of them are particularly easy for Non Language Experts to
>>> stand up
>>> > >>> and use, especially outside of the Beam repo.
>>> > >>>
>>> > >>> The Go SDK's Direct Runner has many flaws, most of which are due to
>>> > >>> Direct execution, rather than Portable Execution.  Implementing
>>> features
>>> > >>> largely meant hacking certain things in, so they would be able to
>>> be
>>> > >>> executed. This also made supporting and testing Cross Language
>>> Transforms,
>>> > >>> State and Timers in Go pipelines a non-starter for users. And
>>> that's just
>>> > >>> the tip.
>>> > >>>
>>> > >>> So I wanted something better. I mentioned it a few times to
>>> others, but
>>> > >>> I kept hearing the same refrain: "I want something that does
>>> that". Or at
>>> > >>> least they wanted something simpler to understand to hack against
>>> > >>> themselves.
>>> > >>>
>>> > >>> I added more tests, and implemented more features, filed a tracking
>>> > >>> issue (https://github.com/apache/beam/issues/24789), re-wrote the
>>> whole
>>> > >>> engine from scratch to actually support watermarks, and
>>> terminating Process
>>> > >>> Continuations. In the process I found and fixed some bugs in the
>>> Go SDK
>>> > >>> too. Many conversations were happening so I could understand how
>>> things are
>>> > >>> supposed to work.
>>> > >>>
>>> > >>> At some point, chatting with Jack McCluskey (@jrmccluskey) we came
>>> up
>>> > >>> with the name Prism (among many many other contenders). I wanted
>>> the runner
>>> > >>> to be able to split up and examine the different components of
>>> Beam and
>>> > >>> combine them in different ways, and Prisms do that for Beams of
>>> light. The
>>> > >>> name stuck as the best one. Formally, it will be the Apache Beam
>>> Prism
>>> > >>> Runner. It can always be argued about
>>> > >>>
>>> > >>> And then I had to stop and clean it up. If I kept going.
>>> eventually it
>>> > >>> becomes too big to review. I've spent the last month, reorganizing
>>> the code
>>> > >>> ~6600+ lines of code and comments. I hope it's clear enough for
>>> others to
>>> > >>> follow at this point, and the initial review PRs will help keep
>>> things
>>> > >>> small.
>>> > >>>
>>> > >>> My expectations for now are to send out this email and have people
>>> take
>>> > >>> a first look at the BIG has everything PR:
>>> > >>> https://github.com/apache/beam/pull/25391. That branch will be
>>> > >>> canonical for other changes until everything is in the beam repo.
>>> It will
>>> > >>> be kept up to date with changes in the smaller PRs.
>>> > >>>
>>> > >>> If you'd like a place to start, I recommend the main README.md
>>> which has
>>> > >>> the rationale and goals.
>>> > >>>
>>> > >>>
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
>>> > >>>
>>> > >>>
>>> > >>> Follow that with the structure README.md in the internal directory.
>>> > >>>
>>> > >>>
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
>>> > >>>
>>> > >>> Finally, see how the (post job submission) part of how a job
>>> executes,
>>> > >>> starting with the executePipeline function:
>>> > >>>
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
>>> > >>>
>>> > >>>
>>> > >>> In the meantime, as the code is going out for review, I'll be
>>> improving
>>> > >>> Unit Test coverage of the sub packages in the structure. When it
>>> was still
>>> > >>> a mono-package though, ~85% coverage was achieved via the test
>>> pipelines.
>>> > >>>
>>> > >>> In the medium term, I'd like to get it working standalone, so any
>>> user
>>> > >>> with a Go install can get a working job runner with a quick `go
>>> install
>>> > >>> github.com/apache/beam/sdks/go/cmd/prism; prism`, and have it work
>>> > >>> indefinitely with tiny scale streaming pipelines.
>>> > >>>
>>> > >>> Longer term, I'd like to get the Java Validates Runner tests
>>> executing
>>> > >>> on it, which will properly validate correctness of details I'm not
>>> aware
>>> > >>> of, and are not covered in the Go SDK integration tests.
>>> > >>>
>>> > >>> As stated, the primary purpose is to simplify testing of Beam
>>> pipelines
>>> > >>> (especially for the Go SDK) and SDK development. As long as it can
>>> be used
>>> > >>> for that, it can also be expanded to do more in time. It's not
>>> expected to
>>> > >>> become "the best" runner, let alone a distributed runner.
>>> > >>>
>>> > >>> I look forward to your thoughts!
>>> > >>>
>>> > >>> Robert Burke
>>> > >>> Beam Go Busy Body
>>> > >>>
>>> > >>>
>>> >
>>>
>>

Re: [Go SDK] Direct Runner Replacement: Prism

Reply via email to