This was super fun, and I really hope it can be an inspiration to others that you can build a working Beam SDK in a week!
(hint hint https://issues.apache.org/jira/browse/BEAM-4010 and https://issues.apache.org/jira/browse/BEAM-12658 :-) On Fri, Jan 14, 2022 at 11:38 AM Robert Bradshaw <[email protected]> wrote: > And, of course, an example: > > > https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/src/apache_beam/examples/wordcount.ts > > On Fri, Jan 14, 2022 at 11:35 AM Robert Bradshaw <[email protected]> > wrote: > > > > Last week at Google we had a hackathon to kick off the new year, and > > one of the projects we came up with was seeing how far we could get in > > putting together a typescript SDK. Starting from nothing we were able > > to make a lot of progress and I wanted to share the results here. > > > > > https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/README.md > > > > I think this is an exciting project and look forward to officially > > supporting a new language. Clearly there is still a fair amount to do, > > and we also need to figure out the best way to get this reviewed (we'd > > especially welcome feedback (and contributions) from those, if any, in > > the know about javascript/typescript/node even if they're not beam or > > distributed computing experts) and into the main repository (assuming > > the community is as interested in this as I am). > > > > The above link is a decent overview, but copying below for posterity > > as that will likely evolve over time (e.g. as decisions get made and > > TODOs get resolved). > > > > - Robert > > > > > > -------------------- > > > > # Node Beam SDK > > > > This is the start of a fully functioning Javascript (actually, > > Typescript) SDK. There are two distinct aims with this SDK > > > > 1. Tap into the large (and relatively underserved, by existing data > > processing frameworks) community of javascript developers with a > > native SDK targeting this language. > > > > 1. Develop a new SDK which can serve both as a proof of concept and > > reference that highlights the (relative) ease of porting Beam to new > > languages, a differentiating feature of Beam and Dataflow. > > > > To accomplish this, we lean heavily on the portability framework. For > > example, we make heavy use of cross-language transforms, in particular > > for IOs (as a full SDF implementation may not fit into the week). In > > addition, the direct runner is simply an extension of the worker > > suitable for running on portable runners such as the ULR, which will > > directly transfer to running on production runners such as Dataflow > > and Flink. The target audience should hopefully not be put off by > > running other language code encapsulated in docker images. > > > > ## API > > > > We generally try to apply the concepts from the Beam API in a > > Typescript idiomatic way, but it should be noted that few of the > > initial developers have extensive (if any) Javascript/Typescript > > development experience, so feedback is greatly appreciated. > > > > In addition, some notable departures are taken from the traditional SDKs: > > > > * We take a "relational foundations" approach, where [schema'd > > data]( > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf > ) > > is the primary way to interact with data, and we generally eschew the > > key-value requiring transforms in favor of a more flexible approach > > naming fields or expressions. Javascript's native Object is used as > > the row type. > > > > * As part of being schema-first we also de-emphasize Coders as a > > first-class concept in the SDK, relegating it to an advance feature > > used for interop. Though we can infer schemas from individual > > elements, it is still TBD to > > figure out if/how we can leverage the type system and/or function > > introspection to regularly infer schemas at construction time. A > > fallback coder using BSON encoding is used when we don't have > > sufficient type information. > > > > * We have added additional methods to the PCollection object, notably > > `map` and `flatmap`, [rather than only allowing > > apply](https://www.mail-archive.com/[email protected]/msg06035.html). > > In addition, `apply` can accept a function argument `(PColletion) => > > ...` as well as a PTransform subclass, which treats this callable as > > if it were a PTransform's expand. > > > > * In the other direction, we have eliminated the [problematic Pipeline > > object](https://s.apache.org/no-beam-pipeline) from the API, instead > > providing a `Root` PValue on which pipelines are built, and invoking > > run() on a Runner. We offer a less error-prone `Runner.run` which > > finishes only when the pipeline is completely finished as well as > > `Runner.runAsync` which returns a handle to the running pipeline. > > > > * Rather than introduce PCollectionTuple, PCollectionList, etc. we let > > PValue literally be an [array or object with PValue > > values]( > https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116 > ) > > which transforms can consume or produce. These are applied by wrapping > > them with the `P` operator, e.g. `P([pc1, pc2, pc3]).apply(new > > Flatten())`. > > > > * Like Python, `flatMap` and `ParDo.process` return multiple elements > > by yielding them from a generator, rather than invoking a passed-in > > callback. TBD how to output to multiple distinct PCollections. There > > is currently an operation to split a PCollection into multiple > > PCollections based on the properties of the elements, and we may > > consider using a callback for side outputs. > > > > * The `map`, `flatmap`, and `ParDo.proceess` methods take an > > additional (optional) context argument, which is similar to the > > keyword arguments used in Python. These can be "ordinary" javascript > > objects (which are passed as is) or special DoFnParam objects which > > provide getters to element-specific information (such as the current > > timestamp, window, or side input) at runtime. > > > > * Javascript supports (and encourages) an asynchronous programing > > model, with many libraries requiring use of the async/await paradigm. > > As there is no way (by design) to go from the asyncronous style back > > to the synchronous style, this needs to be taken into account when > > designing the API. We currently offer asynchronous variants of > > `PValue.apply(...)` (in addition to the synchronous ones, as they are > > easier to chain) as well as making `Runner.run` asynchronous. TBD to > > do this for all user callbacks as well. > > > > ## TODO > > > > This SDK is a work in progress. In January 2022 we developed the > > ability to construct and run basic pipelines (including external > > transforms and running on a portable runner) but the following > > big-ticket items remain. > > > > * Containerization > > > > * Function and object serialization: we currently only support > > "loopback" mode; to be able to run on a remote, distributed manner we > > need to finish up the work in picking closures and DoFn objects. Some > > investigation has been started here, but all existing libraries have > > non-trivial drawbacks. > > > > * Finish the work in building a full SDK container image that starts > > the worker. > > > > * External transforms > > > > * Using external transforms requires that the external expansion > > service already be started and its address provided. We would like to > > automatically start it as we do in Python. > > > > * Artifacts are not currently supported, which will be essential for > > using Java transforms. (All tests use Python.) > > > > * API > > > > * Side inputs are not yet supported. > > > > * There are several TODOs of minor features or design decisions to > finalize. > > > > * Advanced features like metrics, state, timers, and SDF. Possibly > > some of these can wait. > > > > * Infrastructure > > > > * Gradle and Jenkins integration for tests and style enforcement. > > > > * Other > > > > * Standardize on a way for users to pass PTransform names, and > > enforce unique names for pipeline update. > > > > * Use a Javascript Object rather than proto Struct for pipeline > options. > > > > * Though Dataflow Runner v2 supports portability, submission is > > still done via v1beta3 and interaction with GCS rather than the job > > submission API. > > > > * Properly wait for bundle completion. > > > > There is probably more; there are many TODOs littered throughout the > code. > > > > This code has also not yet been fully peer reviewed (it was the result > > of a hackathon) which needs to be done before putting it into the man > > repository. > > > > > > ## Development. > > > > ### Getting stared > > > > Install node.js, and then from within `sdks/node-ts`. > > > > ``` > > npm install > > ``` > > > > ### Running tests > > > > ``` > > $ npm test > > ``` > > > > ### Style > > > > We have adopted prettier which can be run with > > > > ``` > > # npx prettier --write . > > ``` >
