Thanks for the indepth summary.

On Mon, Apr 15, 2019 at 4:19 PM Robert Burke <rob...@frantil.com> wrote:

> Hi Thomas! I'm so glad you asked!
>
> The status of the Go SDK is complicated, so this email can't be brief.
> There's are several dimensions to consider: as a Go Open Source Project,
> User Libraries and Experience, and on Beam Features.
>
> I'm going to be updating the roadmap later this month when I have a spare
> moment.
>
> *tl;dr;*
> I would *love* help in improving the Go SDK, especially around
> interactions with Java/Python/Flink. Java and I do not have a good working
> relationship for operational purposes, and the last time I used Python, I
> had to re-image my machine. There's lots to do, but shouting out tasks to
> the void is rarely as productive as it is cathartic. If there's an offer to
> help, and a preference for/experience with  something to work on, I'm
> willing to find something useful to get started on for you.
>
> (Note: The following are simply my opinion as someone who works with the
> project weekly as a Go programmer, and should not be treated as demands or
> gospel. I just don't have anyone to talk about Go SDK issues with, and my
> previous discussions, have largely seemed to fall on uninterested ears.)
>
> *The SDK can be considered Alpha when all of the following are true:*
> * The SDK is tested by the Beam project on a ULR and on Flink as well as
> Dataflow.
> * The IOs have received some love to ensure they can scale (either through
> SDF or reshuffles), and be portable to different environments (eg. using
> the Go Cloud Development Kit (CDK) libraries).
>    * Cross-Language IO support would also be acceptable.
> * The SDK is using Go Modules for dependency management, marking it as
> version 0.Minor (where Minor should probably track the mainline Beam minor
> version for now).
>
> *We can move to calling it Beta when all of the following are true:*
> * The all implemented Beam features are meaningfully tested on the
> portable runners (eg. a proper "Validates Runner" suite exists in Go)
> * The SDK is properly documented on the Beam site, and in it's Go Docs.
>
> After this, I'll be more comfortable recommending it as something folks
> can use for production.
> That said, there are happy paths that are useable today in batch
> situations.
>
> *Intro*
> The Go SDK is a purely Beam Portable SDK. If it runs on a distributed
> system at all, it's being run portably. Currently it's regularly tested on
> Google Cloud Dataflow (though Dataflow doesn't officially support the SDK
> at this time), and on it's own single bundle Direct Runner (intended for
> unit testing purposes). In addition, it's being tested at scale within
> Google, on an internal runner, where it presently satisfies our performance
> benchmarks, and correctness tests.
>
> I've been working on cases to make the SDK suitable for data processing
> within Google. This unfortunately makes my contributions more towards
> general SDK usability, documentation, and performance, rather than "making
> it usable outside Google". Note this also precludes necessary work to
> resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I
> believe that the SDK must become a good member of the Go ecosystem, the
> Beam ecosystem.
>
> Improved Go Docs, are on their way, and Daniel Oliviera has been helping
> me make the "getting started" experience better by improving pipeline
> construction time error messages.
>
> Finally many of the following issues have JIRAs already, some don't. It
> would take me time I don't have to audit and line everything up for this
> email, please look before you file JIRAs for things mentioned below, should
> the urge strike you.
>
>
> *As a Go Open Source Project*As an open source project written in Go, the
> SDK is lagging on adopting Go Modules for Dependency Management and
> Versioning.
>
> Using Go Modules which would ensure that what the Beam project
> infrastructure is testing what users are getting.  I'm very happy to
> elaborate on this, and have a bit I wrote about it two months ago on the
> topic[1]. But I loathe sending out plans for things that I don't have time
> to work on, so it's only coming to light now.
>
> The short points are:
> * Go is opinionated about versioning since Go 1.11, when Modules were
> introduced. They allow for reproducible builds with versioned deps,
> supported by the Go language tools.
> * Packages 1 & greater are beholden to not make breaking changes. We're
> not yet there with the SDK yet (certainly not a 2.11 product), so IMO the
> SDK should be considered v0.X
> * I don't think it's reasonable to move SDK languages in lockstep with the
> project. Eg. The Go language is considering adopting Generics, which may
> necessitate a Major Version Change to the SDK user surface as it's modified
> to support them. It's not reasonable to move all of beam to a new version
> due to a single language surface.
>    * This isn't an issue since it reads: the Go SDK version X, runs
> against portable beam runners at version Y.
>
> See a recent email discussion thread [2] for other factors relating to
> Gradle.
>
> *User Libraries (IOs, Transforms)*
> There's a lack of testing around the IOs and Transforms in the SDK. In
> some cases, not even unit tests. Very little time has been spent by anyone
> to bring these to production quality.
>
> *The best route to production IOs right now would be to work on Cross
> Language IO support with the Go SDK. I imagine it would be similar to what
> Python is doing.*
>
> The Bounded IOs that exist are largely "toys" not written for serious
> production use. For Bounded cases, this is largely due to the lack of SDF
> or using reshuffle judiciously, or leveraging other known patterns to
> scalably read data. You'll note they aren't meaningfully tested anywhere as
> well.
>
> For Unbounded IOs, there's only 1 presently, and that's the Google Cloud
> PubSub IO. It's not portable. It can't be portable until we've implemented
> State+Timers, or SDFs. At present, it only works on Dataflow, and does so
> with runner substitution. As such, it uses the same pubsub connector that
> Streaming Dataflow jobs use. Interestingly, this means it can scale
> properly, and is technically the only one that can scale properly.
> Unfortunately, it only works on Dataflow.
>
> My work on using the Beam Go SDK inside Google uses a variant of Cross
> Language IO. This is one reason why I haven't spent any time on the IOs,
> because they aren't necessary inside Google, and there's not been a usecase
> I could contrive to spend the time to fix them up so far.
>
> *General SDK Code Quality*
> In my opinion the SDK is presently reasonable on general code quality.
> Most critical aspects have tests, and from Google internal testing on
> complex and large amounts of data, the SDK is performant, once a few bits
> of code generation is done to avoid reflection on the hot path.
>
> Various combinations of features should be vetted together better. Eg.
> Using composites wrapping various other beam primitives. This was an issue
> resolved recently for CoGBKs.
>
> *Beam Features*
> The SDK is largely usable for Batch Pipelines. I know this since that's
> what I'm ensuring is the case for a Google internal runner. I know the
> following "classes of feature" work for the batch use cases, to varying
> levels of documentation and testing.
> * DoFns
> * CombineFns
>   * Combiner Lifting
> * CoGroupByKey (Joins)
> * Side Inputs
> * User Defined Coders
> * Global Windows
> * User Metrics (though they need to move to the new beam Metrics protos)
>
> Streaming is another story. The following aren't implemented
> * State + Timers + Triggers
>   * Necessary for portable pubsub IOs for example.
> * SDFs aren't implemented yet
>    * Necessary for
> * Windows
>    * Session Windows
>    * Custom WindowFns
>
> I haven't run anything in streaming mode, so there are likely other
> features and considerations I'm missing.
>
> The following are implemented but not meaningfully tested
> * Windows
>   * Fixed Windowing
>    * Sliding Windows
>
> Other like Large Iterables Support , or Schema's are not yet implemented
> either. There are likely others, but I'd need to list everything form the
> compatibility matrix.
>
> *What I'm spending my time on*
> Documenting, and debugging google internal user issues. The following
> artifacts will be produced externally in the next few months:
> * Improved user documentation/programming on the Go SDK (targeted to folks
> who know Go, but not Beam, or any distributed programming).
> * An SDK contribution guide to be put on the Wiki, focusing on "Life of a
> Pipeline" from the user controller, to the worker perspective. and where
> each of those parts are being mapped to where the SDK is dealing with them.
> This should enable others to contribute beam features to the SDK.
> * The Versioning Issue mentioned above, it's finicky.
> * Large (State Backed) Iterable Support
>
> *What I'd love help with*
> 1. Getting the existing suite of SDK integration tests running against a
> ULR or Flink (there are Jira's for these).
> 2. Improving existing IOs, adding tests for existing features over adding
> new ones.
>    a) Migrate the existing IOs to use the Go CDK where possible (needs to
> wait for the Versioning/GoModules/Gradle issue to be resolved though).
>
> Your friendly neighbourhood Distributed Gopher Wrangler,
> Robert Burke (@lostluck)
>
> [1]
> https://docs.google.com/document/d/1nB5qCarN0jmo40zH1J0icZa6Wyb0v4u08AdG4WDTTEY/edit
>
> [2]
> https://lists.apache.org/thread.html/8952f546b449ce8682db221e7688db546e25145c31cd835ed88ad172@%3Cdev.beam.apache.org%3E
>
> On Sat, 13 Apr 2019 at 11:30, Thomas Weise <t...@apache.org> wrote:
>
>> How "experimental" is the Go SDK? What are the major work items to reach
>> MVP? How close are we to be able to run let's say wordcount on the portable
>> Flink runner?
>>
>> How current is the roadmap [1]? JIRA [2] could suggest that there is a
>> lot of work left to do?
>>
>> Thanks,
>> Thomas
>>
>> [1] https://beam.apache.org/roadmap/go-sdk/
>> [2]
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20
>>
>>

Reply via email to