Hi Robert,
Thanks a bunch for providing this comprehensive update. This is exactly
the kind of perspective I was looking for, even when overall it means
that for potential users of the Go SDK it is even sooner than what I
might have hoped for.
For more context, my interest was primarily on the streaming side. From
the list of missing features you listed, State + Timers + Triggers would
probably be highest priority. Unfortunately I won't be able to
contribute to the Go SDK anytime soon, so this is mostly fyi in case
anyone else does.
On improving the IOs, I think it would make a lot of sense to focus on
the cross-language route. There has been some work lately to make
existing Beam Java IOs available on the Flink runner (Max would be able
to share more details on that).
Thanks!
Thomas
On Wed, Apr 17, 2019 at 9:56 PM Robert Burke <[email protected]
<mailto:[email protected]>> wrote:
Oh dang. Thanks for mentioning that! Here's an open copy of the
versioning thoughts doc, though there shouldn't be any surprises
from the points I mentioned above.
https://docs.google.com/document/d/1ZjP30zNLWTu_WzkWbgY8F_ZXlA_OWAobAD9PuohJxPg/edit#heading=h.drpipq762xi7
On Wed, 17 Apr 2019 at 21:20, Nathan Fisher <[email protected]
<mailto:[email protected]>> wrote:
Hi Robert,
Great summary on the current state of play. FYI the referenced G
doc doesn't appear to people outside the org as a default.
Great to hear the Go SDK is still getting love. I last looked at
in September-October of last year.
Cheers,
Nathan
On Wed, 17 Apr 2019 at 20:27, Lukasz Cwik <[email protected]
<mailto:[email protected]>> wrote:
Thanks for the indepth summary.
On Mon, Apr 15, 2019 at 4:19 PM Robert Burke
<[email protected] <mailto:[email protected]>> wrote:
Hi Thomas! I'm so glad you asked!
The status of the Go SDK is complicated, so this email
can't be brief. There's are several dimensions to
consider: as a Go Open Source Project, User Libraries
and Experience, and on Beam Features.
I'm going to be updating the roadmap later this month
when I have a spare moment.
*tl;dr;*
I would *love* help in improving the Go SDK, especially
around interactions with Java/Python/Flink. Java and I
do not have a good working relationship for operational
purposes, and the last time I used Python, I had to
re-image my machine. There's lots to do, but shouting
out tasks to the void is rarely as productive as it is
cathartic. If there's an offer to help, and a preference
for/experience with something to work on, I'm willing
to find something useful to get started on for you.
(Note: The following are simply my opinion as someone
who works with the project weekly as a Go programmer,
and should not be treated as demands or gospel. I just
don't have anyone to talk about Go SDK issues with, and
my previous discussions, have largely seemed to fall on
uninterested ears.)
*The SDK can be considered Alpha when all of the
following are true:*
* The SDK is tested by the Beam project on a ULR and on
Flink as well as Dataflow.
* The IOs have received some love to ensure they can
scale (either through SDF or reshuffles), and be
portable to different environments (eg. using the Go
Cloud Development Kit (CDK) libraries).
* Cross-Language IO support would also be acceptable.
* The SDK is using Go Modules for dependency management,
marking it as version 0.Minor (where Minor should
probably track the mainline Beam minor version for now).
*We can move to calling it Beta when all of the
following are true:*
* The all implemented Beam features are meaningfully
tested on the portable runners (eg. a proper "Validates
Runner" suite exists in Go)
* The SDK is properly documented on the Beam site, and
in it's Go Docs.
After this, I'll be more comfortable recommending it as
something folks can use for production.
That said, there are happy paths that are useable today
in batch situations.
*
Intro*
The Go SDK is a purely Beam Portable SDK. If it runs on
a distributed system at all, it's being run portably.
Currently it's regularly tested on Google Cloud Dataflow
(though Dataflow doesn't officially support the SDK at
this time), and on it's own single bundle Direct Runner
(intended for unit testing purposes). In addition, it's
being tested at scale within Google, on an internal
runner, where it presently satisfies our performance
benchmarks, and correctness tests.
I've been working on cases to make the SDK suitable for
data processing within Google. This unfortunately makes
my contributions more towards general SDK usability,
documentation, and performance, rather than "making it
usable outside Google". Note this also precludes
necessary work to resolve issues with running Go SDK
pipelines on Google Cloud Dataflow. I believe that the
SDK must become a good member of the Go ecosystem, the
Beam ecosystem.
Improved Go Docs, are on their way, and Daniel Oliviera
has been helping me make the "getting started"
experience better by improving pipeline construction
time error messages.
Finally many of the following issues have JIRAs already,
some don't. It would take me time I don't have to audit
and line everything up for this email, please look
before you file JIRAs for things mentioned below, should
the urge strike you.
*As a Go Open Source Project
*As an open source project written in Go, the SDK is
lagging on adopting Go Modules for Dependency Management
and Versioning.
Using Go Modules which would ensure that what the Beam
project infrastructure is testing what users are
getting. I'm very happy to elaborate on this, and have
a bit I wrote about it two months ago on the topic[1].
But I loathe sending out plans for things that I don't
have time to work on, so it's only coming to light now.
The short points are:
* Go is opinionated about versioning since Go 1.11, when
Modules were introduced. They allow for reproducible
builds with versioned deps, supported by the Go language
tools.
* Packages 1 & greater are beholden to not make breaking
changes. We're not yet there with the SDK yet (certainly
not a 2.11 product), so IMO the SDK should be considered
v0.X
* I don't think it's reasonable to move SDK languages in
lockstep with the project. Eg. The Go language is
considering adopting Generics, which may necessitate a
Major Version Change to the SDK user surface as it's
modified to support them. It's not reasonable to move
all of beam to a new version due to a single language
surface.
* This isn't an issue since it reads: the Go SDK
version X, runs against portable beam runners at version Y.
See a recent email discussion thread [2] for other
factors relating to Gradle.
*User Libraries (IOs, Transforms)*
There's a lack of testing around the IOs and Transforms
in the SDK. In some cases, not even unit tests. Very
little time has been spent by anyone to bring these to
production quality.
/The best route to production IOs right now would be to
work on Cross Language IO support with the Go SDK. I
imagine it would be similar to what Python is doing./
The Bounded IOs that exist are largely "toys" not
written for serious production use. For Bounded cases,
this is largely due to the lack of SDF or using
reshuffle judiciously, or leveraging other known
patterns to scalably read data. You'll note they aren't
meaningfully tested anywhere as well.
For Unbounded IOs, there's only 1 presently, and that's
the Google Cloud PubSub IO. It's not portable. It can't
be portable until we've implemented State+Timers, or
SDFs. At present, it only works on Dataflow, and does so
with runner substitution. As such, it uses the same
pubsub connector that Streaming Dataflow jobs use.
Interestingly, this means it can scale properly, and is
technically the only one that can scale properly.
Unfortunately, it only works on Dataflow.
My work on using the Beam Go SDK inside Google uses a
variant of Cross Language IO. This is one reason why I
haven't spent any time on the IOs, because they aren't
necessary inside Google, and there's not been a usecase
I could contrive to spend the time to fix them up so far.
*General SDK Code Quality*
In my opinion the SDK is presently reasonable on general
code quality. Most critical aspects have tests, and from
Google internal testing on complex and large amounts of
data, the SDK is performant, once a few bits of code
generation is done to avoid reflection on the hot path.
Various combinations of features should be vetted
together better. Eg. Using composites wrapping various
other beam primitives. This was an issue resolved
recently for CoGBKs.
*Beam Features*
The SDK is largely usable for Batch Pipelines. I know
this since that's what I'm ensuring is the case for a
Google internal runner. I know the following "classes of
feature" work for the batch use cases, to varying levels
of documentation and testing.
* DoFns
* CombineFns
* Combiner Lifting
* CoGroupByKey (Joins)
* Side Inputs
* User Defined Coders
* Global Windows
* User Metrics (though they need to move to the new beam
Metrics protos)
Streaming is another story. The following aren't implemented
* State + Timers + Triggers
* Necessary for portable pubsub IOs for example.
* SDFs aren't implemented yet
* Necessary for
* Windows
* Session Windows
* Custom WindowFns
I haven't run anything in streaming mode, so there are
likely other features and considerations I'm missing.
The following are implemented but not meaningfully tested
* Windows
* Fixed Windowing
* Sliding Windows
Other like Large Iterables Support , or Schema's are not
yet implemented either. There are likely others, but I'd
need to list everything form the compatibility matrix.
*What I'm spending my time on*
Documenting, and debugging google internal user issues.
The following artifacts will be produced externally in
the next few months:
* Improved user documentation/programming on the Go SDK
(targeted to folks who know Go, but not Beam, or any
distributed programming).
* An SDK contribution guide to be put on the Wiki,
focusing on "Life of a Pipeline" from the user
controller, to the worker perspective. and where each of
those parts are being mapped to where the SDK is dealing
with them. This should enable others to contribute beam
features to the SDK.
* The Versioning Issue mentioned above, it's finicky.
* Large (State Backed) Iterable Support
*What I'd love help with*
1. Getting the existing suite of SDK integration tests
running against a ULR or Flink (there are Jira's for these).
2. Improving existing IOs, adding tests for existing
features over adding new ones.
a) Migrate the existing IOs to use the Go CDK where
possible (needs to wait for the
Versioning/GoModules/Gradle issue to be resolved though).
Your friendly neighbourhood Distributed Gopher Wrangler,
Robert Burke (@lostluck)
[1]
https://docs.google.com/document/d/1nB5qCarN0jmo40zH1J0icZa6Wyb0v4u08AdG4WDTTEY/edit
[2]
https://lists.apache.org/thread.html/8952f546b449ce8682db221e7688db546e25145c31cd835ed88ad172@%3Cdev.beam.apache.org%3E
On Sat, 13 Apr 2019 at 11:30, Thomas Weise
<[email protected] <mailto:[email protected]>> wrote:
How "experimental" is the Go SDK? What are the major
work items to reach MVP? How close are we to be able
to run let's say wordcount on the portable Flink runner?
How current is the roadmap [1]? JIRA [2] could
suggest that there is a lot of work left to do?
Thanks,
Thomas
[1] https://beam.apache.org/roadmap/go-sdk/
[2]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20
--
Nathan Fisher
w: http://junctionbox.ca/