Hi Robert, Great summary on the current state of play. FYI the referenced G doc doesn't appear to people outside the org as a default.
Great to hear the Go SDK is still getting love. I last looked at in September-October of last year. Cheers, Nathan On Wed, 17 Apr 2019 at 20:27, Lukasz Cwik <[email protected]> wrote: > Thanks for the indepth summary. > > On Mon, Apr 15, 2019 at 4:19 PM Robert Burke <[email protected]> wrote: > >> Hi Thomas! I'm so glad you asked! >> >> The status of the Go SDK is complicated, so this email can't be brief. >> There's are several dimensions to consider: as a Go Open Source Project, >> User Libraries and Experience, and on Beam Features. >> >> I'm going to be updating the roadmap later this month when I have a spare >> moment. >> >> *tl;dr;* >> I would *love* help in improving the Go SDK, especially around >> interactions with Java/Python/Flink. Java and I do not have a good working >> relationship for operational purposes, and the last time I used Python, I >> had to re-image my machine. There's lots to do, but shouting out tasks to >> the void is rarely as productive as it is cathartic. If there's an offer to >> help, and a preference for/experience with something to work on, I'm >> willing to find something useful to get started on for you. >> >> (Note: The following are simply my opinion as someone who works with the >> project weekly as a Go programmer, and should not be treated as demands or >> gospel. I just don't have anyone to talk about Go SDK issues with, and my >> previous discussions, have largely seemed to fall on uninterested ears.) >> >> *The SDK can be considered Alpha when all of the following are true:* >> * The SDK is tested by the Beam project on a ULR and on Flink as well as >> Dataflow. >> * The IOs have received some love to ensure they can scale (either >> through SDF or reshuffles), and be portable to different environments (eg. >> using the Go Cloud Development Kit (CDK) libraries). >> * Cross-Language IO support would also be acceptable. >> * The SDK is using Go Modules for dependency management, marking it as >> version 0.Minor (where Minor should probably track the mainline Beam minor >> version for now). >> >> *We can move to calling it Beta when all of the following are true:* >> * The all implemented Beam features are meaningfully tested on the >> portable runners (eg. a proper "Validates Runner" suite exists in Go) >> * The SDK is properly documented on the Beam site, and in it's Go Docs. >> >> After this, I'll be more comfortable recommending it as something folks >> can use for production. >> That said, there are happy paths that are useable today in batch >> situations. >> >> *Intro* >> The Go SDK is a purely Beam Portable SDK. If it runs on a distributed >> system at all, it's being run portably. Currently it's regularly tested on >> Google Cloud Dataflow (though Dataflow doesn't officially support the SDK >> at this time), and on it's own single bundle Direct Runner (intended for >> unit testing purposes). In addition, it's being tested at scale within >> Google, on an internal runner, where it presently satisfies our performance >> benchmarks, and correctness tests. >> >> I've been working on cases to make the SDK suitable for data processing >> within Google. This unfortunately makes my contributions more towards >> general SDK usability, documentation, and performance, rather than "making >> it usable outside Google". Note this also precludes necessary work to >> resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I >> believe that the SDK must become a good member of the Go ecosystem, the >> Beam ecosystem. >> >> Improved Go Docs, are on their way, and Daniel Oliviera has been helping >> me make the "getting started" experience better by improving pipeline >> construction time error messages. >> >> Finally many of the following issues have JIRAs already, some don't. It >> would take me time I don't have to audit and line everything up for this >> email, please look before you file JIRAs for things mentioned below, should >> the urge strike you. >> >> >> *As a Go Open Source Project*As an open source project written in Go, >> the SDK is lagging on adopting Go Modules for Dependency Management and >> Versioning. >> >> Using Go Modules which would ensure that what the Beam project >> infrastructure is testing what users are getting. I'm very happy to >> elaborate on this, and have a bit I wrote about it two months ago on the >> topic[1]. But I loathe sending out plans for things that I don't have time >> to work on, so it's only coming to light now. >> >> The short points are: >> * Go is opinionated about versioning since Go 1.11, when Modules were >> introduced. They allow for reproducible builds with versioned deps, >> supported by the Go language tools. >> * Packages 1 & greater are beholden to not make breaking changes. We're >> not yet there with the SDK yet (certainly not a 2.11 product), so IMO the >> SDK should be considered v0.X >> * I don't think it's reasonable to move SDK languages in lockstep with >> the project. Eg. The Go language is considering adopting Generics, which >> may necessitate a Major Version Change to the SDK user surface as it's >> modified to support them. It's not reasonable to move all of beam to a new >> version due to a single language surface. >> * This isn't an issue since it reads: the Go SDK version X, runs >> against portable beam runners at version Y. >> >> See a recent email discussion thread [2] for other factors relating to >> Gradle. >> >> *User Libraries (IOs, Transforms)* >> There's a lack of testing around the IOs and Transforms in the SDK. In >> some cases, not even unit tests. Very little time has been spent by anyone >> to bring these to production quality. >> >> *The best route to production IOs right now would be to work on Cross >> Language IO support with the Go SDK. I imagine it would be similar to what >> Python is doing.* >> >> The Bounded IOs that exist are largely "toys" not written for serious >> production use. For Bounded cases, this is largely due to the lack of SDF >> or using reshuffle judiciously, or leveraging other known patterns to >> scalably read data. You'll note they aren't meaningfully tested anywhere as >> well. >> >> For Unbounded IOs, there's only 1 presently, and that's the Google Cloud >> PubSub IO. It's not portable. It can't be portable until we've implemented >> State+Timers, or SDFs. At present, it only works on Dataflow, and does so >> with runner substitution. As such, it uses the same pubsub connector that >> Streaming Dataflow jobs use. Interestingly, this means it can scale >> properly, and is technically the only one that can scale properly. >> Unfortunately, it only works on Dataflow. >> >> My work on using the Beam Go SDK inside Google uses a variant of Cross >> Language IO. This is one reason why I haven't spent any time on the IOs, >> because they aren't necessary inside Google, and there's not been a usecase >> I could contrive to spend the time to fix them up so far. >> >> *General SDK Code Quality* >> In my opinion the SDK is presently reasonable on general code quality. >> Most critical aspects have tests, and from Google internal testing on >> complex and large amounts of data, the SDK is performant, once a few bits >> of code generation is done to avoid reflection on the hot path. >> >> Various combinations of features should be vetted together better. Eg. >> Using composites wrapping various other beam primitives. This was an issue >> resolved recently for CoGBKs. >> >> *Beam Features* >> The SDK is largely usable for Batch Pipelines. I know this since that's >> what I'm ensuring is the case for a Google internal runner. I know the >> following "classes of feature" work for the batch use cases, to varying >> levels of documentation and testing. >> * DoFns >> * CombineFns >> * Combiner Lifting >> * CoGroupByKey (Joins) >> * Side Inputs >> * User Defined Coders >> * Global Windows >> * User Metrics (though they need to move to the new beam Metrics protos) >> >> Streaming is another story. The following aren't implemented >> * State + Timers + Triggers >> * Necessary for portable pubsub IOs for example. >> * SDFs aren't implemented yet >> * Necessary for >> * Windows >> * Session Windows >> * Custom WindowFns >> >> I haven't run anything in streaming mode, so there are likely other >> features and considerations I'm missing. >> >> The following are implemented but not meaningfully tested >> * Windows >> * Fixed Windowing >> * Sliding Windows >> >> Other like Large Iterables Support , or Schema's are not yet implemented >> either. There are likely others, but I'd need to list everything form the >> compatibility matrix. >> >> *What I'm spending my time on* >> Documenting, and debugging google internal user issues. The following >> artifacts will be produced externally in the next few months: >> * Improved user documentation/programming on the Go SDK (targeted to >> folks who know Go, but not Beam, or any distributed programming). >> * An SDK contribution guide to be put on the Wiki, focusing on "Life of a >> Pipeline" from the user controller, to the worker perspective. and where >> each of those parts are being mapped to where the SDK is dealing with them. >> This should enable others to contribute beam features to the SDK. >> * The Versioning Issue mentioned above, it's finicky. >> * Large (State Backed) Iterable Support >> >> *What I'd love help with* >> 1. Getting the existing suite of SDK integration tests running against a >> ULR or Flink (there are Jira's for these). >> 2. Improving existing IOs, adding tests for existing features over adding >> new ones. >> a) Migrate the existing IOs to use the Go CDK where possible (needs to >> wait for the Versioning/GoModules/Gradle issue to be resolved though). >> >> Your friendly neighbourhood Distributed Gopher Wrangler, >> Robert Burke (@lostluck) >> >> [1] >> https://docs.google.com/document/d/1nB5qCarN0jmo40zH1J0icZa6Wyb0v4u08AdG4WDTTEY/edit >> >> [2] >> https://lists.apache.org/thread.html/8952f546b449ce8682db221e7688db546e25145c31cd835ed88ad172@%3Cdev.beam.apache.org%3E >> >> On Sat, 13 Apr 2019 at 11:30, Thomas Weise <[email protected]> wrote: >> >>> How "experimental" is the Go SDK? What are the major work items to reach >>> MVP? How close are we to be able to run let's say wordcount on the portable >>> Flink runner? >>> >>> How current is the roadmap [1]? JIRA [2] could suggest that there is a >>> lot of work left to do? >>> >>> Thanks, >>> Thomas >>> >>> [1] https://beam.apache.org/roadmap/go-sdk/ >>> [2] >>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20 >>> >>> -- Nathan Fisher w: http://junctionbox.ca/
