Hi Thomas! I'm so glad you asked! The status of the Go SDK is complicated, so this email can't be brief. There's are several dimensions to consider: as a Go Open Source Project, User Libraries and Experience, and on Beam Features.
I'm going to be updating the roadmap later this month when I have a spare moment. *tl;dr;* I would *love* help in improving the Go SDK, especially around interactions with Java/Python/Flink. Java and I do not have a good working relationship for operational purposes, and the last time I used Python, I had to re-image my machine. There's lots to do, but shouting out tasks to the void is rarely as productive as it is cathartic. If there's an offer to help, and a preference for/experience with something to work on, I'm willing to find something useful to get started on for you. (Note: The following are simply my opinion as someone who works with the project weekly as a Go programmer, and should not be treated as demands or gospel. I just don't have anyone to talk about Go SDK issues with, and my previous discussions, have largely seemed to fall on uninterested ears.) *The SDK can be considered Alpha when all of the following are true:* * The SDK is tested by the Beam project on a ULR and on Flink as well as Dataflow. * The IOs have received some love to ensure they can scale (either through SDF or reshuffles), and be portable to different environments (eg. using the Go Cloud Development Kit (CDK) libraries). * Cross-Language IO support would also be acceptable. * The SDK is using Go Modules for dependency management, marking it as version 0.Minor (where Minor should probably track the mainline Beam minor version for now). *We can move to calling it Beta when all of the following are true:* * The all implemented Beam features are meaningfully tested on the portable runners (eg. a proper "Validates Runner" suite exists in Go) * The SDK is properly documented on the Beam site, and in it's Go Docs. After this, I'll be more comfortable recommending it as something folks can use for production. That said, there are happy paths that are useable today in batch situations. *Intro* The Go SDK is a purely Beam Portable SDK. If it runs on a distributed system at all, it's being run portably. Currently it's regularly tested on Google Cloud Dataflow (though Dataflow doesn't officially support the SDK at this time), and on it's own single bundle Direct Runner (intended for unit testing purposes). In addition, it's being tested at scale within Google, on an internal runner, where it presently satisfies our performance benchmarks, and correctness tests. I've been working on cases to make the SDK suitable for data processing within Google. This unfortunately makes my contributions more towards general SDK usability, documentation, and performance, rather than "making it usable outside Google". Note this also precludes necessary work to resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I believe that the SDK must become a good member of the Go ecosystem, the Beam ecosystem. Improved Go Docs, are on their way, and Daniel Oliviera has been helping me make the "getting started" experience better by improving pipeline construction time error messages. Finally many of the following issues have JIRAs already, some don't. It would take me time I don't have to audit and line everything up for this email, please look before you file JIRAs for things mentioned below, should the urge strike you. *As a Go Open Source Project*As an open source project written in Go, the SDK is lagging on adopting Go Modules for Dependency Management and Versioning. Using Go Modules which would ensure that what the Beam project infrastructure is testing what users are getting. I'm very happy to elaborate on this, and have a bit I wrote about it two months ago on the topic[1]. But I loathe sending out plans for things that I don't have time to work on, so it's only coming to light now. The short points are: * Go is opinionated about versioning since Go 1.11, when Modules were introduced. They allow for reproducible builds with versioned deps, supported by the Go language tools. * Packages 1 & greater are beholden to not make breaking changes. We're not yet there with the SDK yet (certainly not a 2.11 product), so IMO the SDK should be considered v0.X * I don't think it's reasonable to move SDK languages in lockstep with the project. Eg. The Go language is considering adopting Generics, which may necessitate a Major Version Change to the SDK user surface as it's modified to support them. It's not reasonable to move all of beam to a new version due to a single language surface. * This isn't an issue since it reads: the Go SDK version X, runs against portable beam runners at version Y. See a recent email discussion thread [2] for other factors relating to Gradle. *User Libraries (IOs, Transforms)* There's a lack of testing around the IOs and Transforms in the SDK. In some cases, not even unit tests. Very little time has been spent by anyone to bring these to production quality. *The best route to production IOs right now would be to work on Cross Language IO support with the Go SDK. I imagine it would be similar to what Python is doing.* The Bounded IOs that exist are largely "toys" not written for serious production use. For Bounded cases, this is largely due to the lack of SDF or using reshuffle judiciously, or leveraging other known patterns to scalably read data. You'll note they aren't meaningfully tested anywhere as well. For Unbounded IOs, there's only 1 presently, and that's the Google Cloud PubSub IO. It's not portable. It can't be portable until we've implemented State+Timers, or SDFs. At present, it only works on Dataflow, and does so with runner substitution. As such, it uses the same pubsub connector that Streaming Dataflow jobs use. Interestingly, this means it can scale properly, and is technically the only one that can scale properly. Unfortunately, it only works on Dataflow. My work on using the Beam Go SDK inside Google uses a variant of Cross Language IO. This is one reason why I haven't spent any time on the IOs, because they aren't necessary inside Google, and there's not been a usecase I could contrive to spend the time to fix them up so far. *General SDK Code Quality* In my opinion the SDK is presently reasonable on general code quality. Most critical aspects have tests, and from Google internal testing on complex and large amounts of data, the SDK is performant, once a few bits of code generation is done to avoid reflection on the hot path. Various combinations of features should be vetted together better. Eg. Using composites wrapping various other beam primitives. This was an issue resolved recently for CoGBKs. *Beam Features* The SDK is largely usable for Batch Pipelines. I know this since that's what I'm ensuring is the case for a Google internal runner. I know the following "classes of feature" work for the batch use cases, to varying levels of documentation and testing. * DoFns * CombineFns * Combiner Lifting * CoGroupByKey (Joins) * Side Inputs * User Defined Coders * Global Windows * User Metrics (though they need to move to the new beam Metrics protos) Streaming is another story. The following aren't implemented * State + Timers + Triggers * Necessary for portable pubsub IOs for example. * SDFs aren't implemented yet * Necessary for * Windows * Session Windows * Custom WindowFns I haven't run anything in streaming mode, so there are likely other features and considerations I'm missing. The following are implemented but not meaningfully tested * Windows * Fixed Windowing * Sliding Windows Other like Large Iterables Support , or Schema's are not yet implemented either. There are likely others, but I'd need to list everything form the compatibility matrix. *What I'm spending my time on* Documenting, and debugging google internal user issues. The following artifacts will be produced externally in the next few months: * Improved user documentation/programming on the Go SDK (targeted to folks who know Go, but not Beam, or any distributed programming). * An SDK contribution guide to be put on the Wiki, focusing on "Life of a Pipeline" from the user controller, to the worker perspective. and where each of those parts are being mapped to where the SDK is dealing with them. This should enable others to contribute beam features to the SDK. * The Versioning Issue mentioned above, it's finicky. * Large (State Backed) Iterable Support *What I'd love help with* 1. Getting the existing suite of SDK integration tests running against a ULR or Flink (there are Jira's for these). 2. Improving existing IOs, adding tests for existing features over adding new ones. a) Migrate the existing IOs to use the Go CDK where possible (needs to wait for the Versioning/GoModules/Gradle issue to be resolved though). Your friendly neighbourhood Distributed Gopher Wrangler, Robert Burke (@lostluck) [1] https://docs.google.com/document/d/1nB5qCarN0jmo40zH1J0icZa6Wyb0v4u08AdG4WDTTEY/edit [2] https://lists.apache.org/thread.html/8952f546b449ce8682db221e7688db546e25145c31cd835ed88ad172@%3Cdev.beam.apache.org%3E On Sat, 13 Apr 2019 at 11:30, Thomas Weise <[email protected]> wrote: > How "experimental" is the Go SDK? What are the major work items to reach > MVP? How close are we to be able to run let's say wordcount on the portable > Flink runner? > > How current is the roadmap [1]? JIRA [2] could suggest that there is a lot > of work left to do? > > Thanks, > Thomas > > [1] https://beam.apache.org/roadmap/go-sdk/ > [2] > https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20 > >
