Oh dang. Thanks for mentioning that! Here's an open copy of the versioning thoughts doc, though there shouldn't be any surprises from the points I mentioned above.
https://docs.google.com/document/d/1ZjP30zNLWTu_WzkWbgY8F_ZXlA_OWAobAD9PuohJxPg/edit#heading=h.drpipq762xi7 On Wed, 17 Apr 2019 at 21:20, Nathan Fisher <[email protected]> wrote: > Hi Robert, > > Great summary on the current state of play. FYI the referenced G doc > doesn't appear to people outside the org as a default. > > Great to hear the Go SDK is still getting love. I last looked at in > September-October of last year. > > Cheers, > Nathan > > On Wed, 17 Apr 2019 at 20:27, Lukasz Cwik <[email protected]> wrote: > >> Thanks for the indepth summary. >> >> On Mon, Apr 15, 2019 at 4:19 PM Robert Burke <[email protected]> wrote: >> >>> Hi Thomas! I'm so glad you asked! >>> >>> The status of the Go SDK is complicated, so this email can't be brief. >>> There's are several dimensions to consider: as a Go Open Source Project, >>> User Libraries and Experience, and on Beam Features. >>> >>> I'm going to be updating the roadmap later this month when I have a >>> spare moment. >>> >>> *tl;dr;* >>> I would *love* help in improving the Go SDK, especially around >>> interactions with Java/Python/Flink. Java and I do not have a good working >>> relationship for operational purposes, and the last time I used Python, I >>> had to re-image my machine. There's lots to do, but shouting out tasks to >>> the void is rarely as productive as it is cathartic. If there's an offer to >>> help, and a preference for/experience with something to work on, I'm >>> willing to find something useful to get started on for you. >>> >>> (Note: The following are simply my opinion as someone who works with the >>> project weekly as a Go programmer, and should not be treated as demands or >>> gospel. I just don't have anyone to talk about Go SDK issues with, and my >>> previous discussions, have largely seemed to fall on uninterested ears.) >>> >>> *The SDK can be considered Alpha when all of the following are true:* >>> * The SDK is tested by the Beam project on a ULR and on Flink as well as >>> Dataflow. >>> * The IOs have received some love to ensure they can scale (either >>> through SDF or reshuffles), and be portable to different environments (eg. >>> using the Go Cloud Development Kit (CDK) libraries). >>> * Cross-Language IO support would also be acceptable. >>> * The SDK is using Go Modules for dependency management, marking it as >>> version 0.Minor (where Minor should probably track the mainline Beam minor >>> version for now). >>> >>> *We can move to calling it Beta when all of the following are true:* >>> * The all implemented Beam features are meaningfully tested on the >>> portable runners (eg. a proper "Validates Runner" suite exists in Go) >>> * The SDK is properly documented on the Beam site, and in it's Go Docs. >>> >>> After this, I'll be more comfortable recommending it as something folks >>> can use for production. >>> That said, there are happy paths that are useable today in batch >>> situations. >>> >>> *Intro* >>> The Go SDK is a purely Beam Portable SDK. If it runs on a distributed >>> system at all, it's being run portably. Currently it's regularly tested on >>> Google Cloud Dataflow (though Dataflow doesn't officially support the SDK >>> at this time), and on it's own single bundle Direct Runner (intended for >>> unit testing purposes). In addition, it's being tested at scale within >>> Google, on an internal runner, where it presently satisfies our performance >>> benchmarks, and correctness tests. >>> >>> I've been working on cases to make the SDK suitable for data processing >>> within Google. This unfortunately makes my contributions more towards >>> general SDK usability, documentation, and performance, rather than "making >>> it usable outside Google". Note this also precludes necessary work to >>> resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I >>> believe that the SDK must become a good member of the Go ecosystem, the >>> Beam ecosystem. >>> >>> Improved Go Docs, are on their way, and Daniel Oliviera has been helping >>> me make the "getting started" experience better by improving pipeline >>> construction time error messages. >>> >>> Finally many of the following issues have JIRAs already, some don't. It >>> would take me time I don't have to audit and line everything up for this >>> email, please look before you file JIRAs for things mentioned below, should >>> the urge strike you. >>> >>> >>> *As a Go Open Source Project*As an open source project written in Go, >>> the SDK is lagging on adopting Go Modules for Dependency Management and >>> Versioning. >>> >>> Using Go Modules which would ensure that what the Beam project >>> infrastructure is testing what users are getting. I'm very happy to >>> elaborate on this, and have a bit I wrote about it two months ago on the >>> topic[1]. But I loathe sending out plans for things that I don't have time >>> to work on, so it's only coming to light now. >>> >>> The short points are: >>> * Go is opinionated about versioning since Go 1.11, when Modules were >>> introduced. They allow for reproducible builds with versioned deps, >>> supported by the Go language tools. >>> * Packages 1 & greater are beholden to not make breaking changes. We're >>> not yet there with the SDK yet (certainly not a 2.11 product), so IMO the >>> SDK should be considered v0.X >>> * I don't think it's reasonable to move SDK languages in lockstep with >>> the project. Eg. The Go language is considering adopting Generics, which >>> may necessitate a Major Version Change to the SDK user surface as it's >>> modified to support them. It's not reasonable to move all of beam to a new >>> version due to a single language surface. >>> * This isn't an issue since it reads: the Go SDK version X, runs >>> against portable beam runners at version Y. >>> >>> See a recent email discussion thread [2] for other factors relating to >>> Gradle. >>> >>> *User Libraries (IOs, Transforms)* >>> There's a lack of testing around the IOs and Transforms in the SDK. In >>> some cases, not even unit tests. Very little time has been spent by anyone >>> to bring these to production quality. >>> >>> *The best route to production IOs right now would be to work on Cross >>> Language IO support with the Go SDK. I imagine it would be similar to what >>> Python is doing.* >>> >>> The Bounded IOs that exist are largely "toys" not written for serious >>> production use. For Bounded cases, this is largely due to the lack of SDF >>> or using reshuffle judiciously, or leveraging other known patterns to >>> scalably read data. You'll note they aren't meaningfully tested anywhere as >>> well. >>> >>> For Unbounded IOs, there's only 1 presently, and that's the Google Cloud >>> PubSub IO. It's not portable. It can't be portable until we've implemented >>> State+Timers, or SDFs. At present, it only works on Dataflow, and does so >>> with runner substitution. As such, it uses the same pubsub connector that >>> Streaming Dataflow jobs use. Interestingly, this means it can scale >>> properly, and is technically the only one that can scale properly. >>> Unfortunately, it only works on Dataflow. >>> >>> My work on using the Beam Go SDK inside Google uses a variant of Cross >>> Language IO. This is one reason why I haven't spent any time on the IOs, >>> because they aren't necessary inside Google, and there's not been a usecase >>> I could contrive to spend the time to fix them up so far. >>> >>> *General SDK Code Quality* >>> In my opinion the SDK is presently reasonable on general code quality. >>> Most critical aspects have tests, and from Google internal testing on >>> complex and large amounts of data, the SDK is performant, once a few bits >>> of code generation is done to avoid reflection on the hot path. >>> >>> Various combinations of features should be vetted together better. Eg. >>> Using composites wrapping various other beam primitives. This was an issue >>> resolved recently for CoGBKs. >>> >>> *Beam Features* >>> The SDK is largely usable for Batch Pipelines. I know this since that's >>> what I'm ensuring is the case for a Google internal runner. I know the >>> following "classes of feature" work for the batch use cases, to varying >>> levels of documentation and testing. >>> * DoFns >>> * CombineFns >>> * Combiner Lifting >>> * CoGroupByKey (Joins) >>> * Side Inputs >>> * User Defined Coders >>> * Global Windows >>> * User Metrics (though they need to move to the new beam Metrics protos) >>> >>> Streaming is another story. The following aren't implemented >>> * State + Timers + Triggers >>> * Necessary for portable pubsub IOs for example. >>> * SDFs aren't implemented yet >>> * Necessary for >>> * Windows >>> * Session Windows >>> * Custom WindowFns >>> >>> I haven't run anything in streaming mode, so there are likely other >>> features and considerations I'm missing. >>> >>> The following are implemented but not meaningfully tested >>> * Windows >>> * Fixed Windowing >>> * Sliding Windows >>> >>> Other like Large Iterables Support , or Schema's are not yet implemented >>> either. There are likely others, but I'd need to list everything form the >>> compatibility matrix. >>> >>> *What I'm spending my time on* >>> Documenting, and debugging google internal user issues. The following >>> artifacts will be produced externally in the next few months: >>> * Improved user documentation/programming on the Go SDK (targeted to >>> folks who know Go, but not Beam, or any distributed programming). >>> * An SDK contribution guide to be put on the Wiki, focusing on "Life of >>> a Pipeline" from the user controller, to the worker perspective. and where >>> each of those parts are being mapped to where the SDK is dealing with them. >>> This should enable others to contribute beam features to the SDK. >>> * The Versioning Issue mentioned above, it's finicky. >>> * Large (State Backed) Iterable Support >>> >>> *What I'd love help with* >>> 1. Getting the existing suite of SDK integration tests running against a >>> ULR or Flink (there are Jira's for these). >>> 2. Improving existing IOs, adding tests for existing features over >>> adding new ones. >>> a) Migrate the existing IOs to use the Go CDK where possible (needs >>> to wait for the Versioning/GoModules/Gradle issue to be resolved though). >>> >>> Your friendly neighbourhood Distributed Gopher Wrangler, >>> Robert Burke (@lostluck) >>> >>> [1] >>> https://docs.google.com/document/d/1nB5qCarN0jmo40zH1J0icZa6Wyb0v4u08AdG4WDTTEY/edit >>> >>> [2] >>> https://lists.apache.org/thread.html/8952f546b449ce8682db221e7688db546e25145c31cd835ed88ad172@%3Cdev.beam.apache.org%3E >>> >>> On Sat, 13 Apr 2019 at 11:30, Thomas Weise <[email protected]> wrote: >>> >>>> How "experimental" is the Go SDK? What are the major work items to >>>> reach MVP? How close are we to be able to run let's say wordcount on the >>>> portable Flink runner? >>>> >>>> How current is the roadmap [1]? JIRA [2] could suggest that there is a >>>> lot of work left to do? >>>> >>>> Thanks, >>>> Thomas >>>> >>>> [1] https://beam.apache.org/roadmap/go-sdk/ >>>> [2] >>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20 >>>> >>>> > > -- > Nathan Fisher > w: http://junctionbox.ca/ >
