Re: Go SDK status

Robert Burke Mon, 15 Apr 2019 16:19:43 -0700

Hi Thomas! I'm so glad you asked!

The status of the Go SDK is complicated, so this email can't be brief.
There's are several dimensions to consider: as a Go Open Source Project,
User Libraries and Experience, and on Beam Features.

I'm going to be updating the roadmap later this month when I have a spare
moment.

*tl;dr;*
I would *love* help in improving the Go SDK, especially around interactions
with Java/Python/Flink. Java and I do not have a good working relationship
for operational purposes, and the last time I used Python, I had to
re-image my machine. There's lots to do, but shouting out tasks to the void
is rarely as productive as it is cathartic. If there's an offer to help,
and a preference for/experience with something to work on, I'm willing to
find something useful to get started on for you.

(Note: The following are simply my opinion as someone who works with the
project weekly as a Go programmer, and should not be treated as demands or
gospel. I just don't have anyone to talk about Go SDK issues with, and my
previous discussions, have largely seemed to fall on uninterested ears.)

*The SDK can be considered Alpha when all of the following are true:*
* The SDK is tested by the Beam project on a ULR and on Flink as well as
Dataflow.
* The IOs have received some love to ensure they can scale (either through
SDF or reshuffles), and be portable to different environments (eg. using
the Go Cloud Development Kit (CDK) libraries).
* Cross-Language IO support would also be acceptable.
* The SDK is using Go Modules for dependency management, marking it as
version 0.Minor (where Minor should probably track the mainline Beam minor
version for now).

*We can move to calling it Beta when all of the following are true:*
* The all implemented Beam features are meaningfully tested on the portable
runners (eg. a proper "Validates Runner" suite exists in Go)
* The SDK is properly documented on the Beam site, and in it's Go Docs.

After this, I'll be more comfortable recommending it as something folks can
use for production.
That said, there are happy paths that are useable today in batch situations.

*Intro*
The Go SDK is a purely Beam Portable SDK. If it runs on a distributed
system at all, it's being run portably. Currently it's regularly tested on
Google Cloud Dataflow (though Dataflow doesn't officially support the SDK
at this time), and on it's own single bundle Direct Runner (intended for
unit testing purposes). In addition, it's being tested at scale within
Google, on an internal runner, where it presently satisfies our performance
benchmarks, and correctness tests.

I've been working on cases to make the SDK suitable for data processing
within Google. This unfortunately makes my contributions more towards
general SDK usability, documentation, and performance, rather than "making
it usable outside Google". Note this also precludes necessary work to
resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I
believe that the SDK must become a good member of the Go ecosystem, the
Beam ecosystem.

Improved Go Docs, are on their way, and Daniel Oliviera has been helping me
make the "getting started" experience better by improving pipeline
construction time error messages.

Finally many of the following issues have JIRAs already, some don't. It
would take me time I don't have to audit and line everything up for this
email, please look before you file JIRAs for things mentioned below, should
the urge strike you.

*As a Go Open Source Project*As an open source project written in Go, the
SDK is lagging on adopting Go Modules for Dependency Management and
Versioning.

Using Go Modules which would ensure that what the Beam project
infrastructure is testing what users are getting. I'm very happy to
elaborate on this, and have a bit I wrote about it two months ago on the
topic[1]. But I loathe sending out plans for things that I don't have time
to work on, so it's only coming to light now.

The short points are:
* Go is opinionated about versioning since Go 1.11, when Modules were
introduced. They allow for reproducible builds with versioned deps,
supported by the Go language tools.
* Packages 1 & greater are beholden to not make breaking changes. We're not
yet there with the SDK yet (certainly not a 2.11 product), so IMO the SDK
should be considered v0.X
* I don't think it's reasonable to move SDK languages in lockstep with the
project. Eg. The Go language is considering adopting Generics, which may
necessitate a Major Version Change to the SDK user surface as it's modified
to support them. It's not reasonable to move all of beam to a new version
due to a single language surface.
* This isn't an issue since it reads: the Go SDK version X, runs against
portable beam runners at version Y.

See a recent email discussion thread [2] for other factors relating to
Gradle.

*User Libraries (IOs, Transforms)*
There's a lack of testing around the IOs and Transforms in the SDK. In some
cases, not even unit tests. Very little time has been spent by anyone to
bring these to production quality.

*The best route to production IOs right now would be to work on Cross
Language IO support with the Go SDK. I imagine it would be similar to what
Python is doing.*

The Bounded IOs that exist are largely "toys" not written for serious
production use. For Bounded cases, this is largely due to the lack of SDF
or using reshuffle judiciously, or leveraging other known patterns to
scalably read data. You'll note they aren't meaningfully tested anywhere as
well.

For Unbounded IOs, there's only 1 presently, and that's the Google Cloud
PubSub IO. It's not portable. It can't be portable until we've implemented
State+Timers, or SDFs. At present, it only works on Dataflow, and does so
with runner substitution. As such, it uses the same pubsub connector that
Streaming Dataflow jobs use. Interestingly, this means it can scale
properly, and is technically the only one that can scale properly.
Unfortunately, it only works on Dataflow.

My work on using the Beam Go SDK inside Google uses a variant of Cross
Language IO. This is one reason why I haven't spent any time on the IOs,
because they aren't necessary inside Google, and there's not been a usecase
I could contrive to spend the time to fix them up so far.

*General SDK Code Quality*
In my opinion the SDK is presently reasonable on general code quality. Most
critical aspects have tests, and from Google internal testing on complex
and large amounts of data, the SDK is performant, once a few bits of code
generation is done to avoid reflection on the hot path.

Various combinations of features should be vetted together better. Eg.
Using composites wrapping various other beam primitives. This was an issue
resolved recently for CoGBKs.

*Beam Features*
The SDK is largely usable for Batch Pipelines. I know this since that's
what I'm ensuring is the case for a Google internal runner. I know the
following "classes of feature" work for the batch use cases, to varying
levels of documentation and testing.
* DoFns
* CombineFns
* Combiner Lifting
* CoGroupByKey (Joins)
* Side Inputs
* User Defined Coders
* Global Windows
* User Metrics (though they need to move to the new beam Metrics protos)

Streaming is another story. The following aren't implemented
* State + Timers + Triggers
* Necessary for portable pubsub IOs for example.
* SDFs aren't implemented yet
* Necessary for
* Windows
* Session Windows
* Custom WindowFns

I haven't run anything in streaming mode, so there are likely other
features and considerations I'm missing.

The following are implemented but not meaningfully tested
* Windows
* Fixed Windowing
* Sliding Windows

Other like Large Iterables Support , or Schema's are not yet implemented
either. There are likely others, but I'd need to list everything form the
compatibility matrix.

*What I'm spending my time on*
Documenting, and debugging google internal user issues. The following
artifacts will be produced externally in the next few months:
* Improved user documentation/programming on the Go SDK (targeted to folks
who know Go, but not Beam, or any distributed programming).
* An SDK contribution guide to be put on the Wiki, focusing on "Life of a
Pipeline" from the user controller, to the worker perspective. and where
each of those parts are being mapped to where the SDK is dealing with them.
This should enable others to contribute beam features to the SDK.
* The Versioning Issue mentioned above, it's finicky.
* Large (State Backed) Iterable Support

*What I'd love help with*
1. Getting the existing suite of SDK integration tests running against a
ULR or Flink (there are Jira's for these).
2. Improving existing IOs, adding tests for existing features over adding
new ones.
a) Migrate the existing IOs to use the Go CDK where possible (needs to
wait for the Versioning/GoModules/Gradle issue to be resolved though).

Your friendly neighbourhood Distributed Gopher Wrangler,
Robert Burke (@lostluck)

[1]
https://docs.google.com/document/d/1nB5qCarN0jmo40zH1J0icZa6Wyb0v4u08AdG4WDTTEY/edit

[2]
https://lists.apache.org/thread.html/8952f546b449ce8682db221e7688db546e25145c31cd835ed88ad172@%3Cdev.beam.apache.org%3E

On Sat, 13 Apr 2019 at 11:30, Thomas Weise <t...@apache.org> wrote:

> How "experimental" is the Go SDK? What are the major work items to reach
> MVP? How close are we to be able to run let's say wordcount on the portable
> Flink runner?
>
> How current is the roadmap [1]? JIRA [2] could suggest that there is a lot
> of work left to do?
>
> Thanks,
> Thomas
>
> [1] https://beam.apache.org/roadmap/go-sdk/
> [2]
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20
>
>

Re: Go SDK status

Reply via email to