On Thu, Feb 3, 2022 at 3:43 PM Robert Burke <[email protected]> wrote:
> Personally, if it gets added to the repo at all I'd rather we rip off the > band-aid and at least have all the tests regularly run, and various GitHub > actions. Even if we aren't doing the container release activities, because > it's experimental, that's much better than bit rot and being part of the > main repo has a simpler contribution convention. > 100% agree about bit rot and it also makes it more accessible for contribution and experiment. This is a strong motivation for me to get it right onto master. Some contributions to branches are probably just unknown to a lot of contributors (or adventurous users), for example https://github.com/apache/beam/tree/tez-runner https://github.com/apache/beam/tree/jstorm-runner https://github.com/apache/beam/tree/mr-runner I'm guessing since node has a distribution via npm if we do nothing it is essentially "not released". I don't see it as a big problem having it in the archived ASF source releases, as long as licenses and whatnot are good, though I may be overlooking something. Kenn > > Those are my 2 cents. > Robert B > Beam Go Busybody > > On Thu, Feb 3, 2022, 3:29 PM Kenneth Knowles <[email protected]> wrote: > >> We did the same for the Go SDK for some time. I imagine just "not doing >> the work to release it" suffices? Maybe +Robert Burke <[email protected]> has >> some other memories of how to not release. >> >> Kenn >> >> On Mon, Jan 31, 2022 at 1:05 PM Kerry Donny-Clark <[email protected]> >> wrote: >> >>> This project was a great way to kickstart a new SDK. I'd like to bring >>> this into Beam and start cleanup. Are there any steps to take before making >>> a PR? Is there a way to mark this as experimental/not for release? >>> Kerry >>> >>> On Mon, Jan 17, 2022 at 1:22 AM Pablo Estrada <[email protected]> >>> wrote: >>> >>>> This project was fun, and I learned a lot putting some time into it. >>>> I'd love for it to be brought into the main repository and worked over some >>>> time to be fully supported. >>>> Best >>>> -P. >>>> >>>> On Fri, Jan 14, 2022 at 4:46 PM Ahmet Altay <[email protected]> wrote: >>>> >>>>> Really nice! Congratulations to all who worked on this project. >>>>> >>>>> On Fri, Jan 14, 2022 at 4:41 PM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>>>>> This was super fun, and I really hope it can be an inspiration to >>>>>> others that you can build a working Beam SDK in a week! >>>>>> >>>>>> (hint hint https://issues.apache.org/jira/browse/BEAM-4010 and >>>>>> https://issues.apache.org/jira/browse/BEAM-12658 :-) >>>>>> >>>>>> On Fri, Jan 14, 2022 at 11:38 AM Robert Bradshaw <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> And, of course, an example: >>>>>>> >>>>>>> >>>>>>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/src/apache_beam/examples/wordcount.ts >>>>>>> >>>>>>> On Fri, Jan 14, 2022 at 11:35 AM Robert Bradshaw < >>>>>>> [email protected]> wrote: >>>>>>> > >>>>>>> > Last week at Google we had a hackathon to kick off the new year, >>>>>>> and >>>>>>> > one of the projects we came up with was seeing how far we could >>>>>>> get in >>>>>>> > putting together a typescript SDK. Starting from nothing we were >>>>>>> able >>>>>>> > to make a lot of progress and I wanted to share the results here. >>>>>>> > >>>>>>> > >>>>>>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/README.md >>>>>>> > >>>>>>> > I think this is an exciting project and look forward to officially >>>>>>> > supporting a new language. Clearly there is still a fair amount to >>>>>>> do, >>>>>>> > and we also need to figure out the best way to get this reviewed >>>>>>> (we'd >>>>>>> > especially welcome feedback (and contributions) from those, if >>>>>>> any, in >>>>>>> > the know about javascript/typescript/node even if they're not beam >>>>>>> or >>>>>>> > distributed computing experts) and into the main repository >>>>>>> (assuming >>>>>>> > the community is as interested in this as I am). >>>>>>> > >>>>>>> > The above link is a decent overview, but copying below for >>>>>>> posterity >>>>>>> > as that will likely evolve over time (e.g. as decisions get made >>>>>>> and >>>>>>> > TODOs get resolved). >>>>>>> > >>>>>>> > - Robert >>>>>>> > >>>>>>> > >>>>>>> > -------------------- >>>>>>> > >>>>>>> > # Node Beam SDK >>>>>>> > >>>>>>> > This is the start of a fully functioning Javascript (actually, >>>>>>> > Typescript) SDK. There are two distinct aims with this SDK >>>>>>> > >>>>>>> > 1. Tap into the large (and relatively underserved, by existing data >>>>>>> > processing frameworks) community of javascript developers with a >>>>>>> > native SDK targeting this language. >>>>>>> > >>>>>>> > 1. Develop a new SDK which can serve both as a proof of concept and >>>>>>> > reference that highlights the (relative) ease of porting Beam to >>>>>>> new >>>>>>> > languages, a differentiating feature of Beam and Dataflow. >>>>>>> > >>>>>>> > To accomplish this, we lean heavily on the portability framework. >>>>>>> For >>>>>>> > example, we make heavy use of cross-language transforms, in >>>>>>> particular >>>>>>> > for IOs (as a full SDF implementation may not fit into the week). >>>>>>> In >>>>>>> > addition, the direct runner is simply an extension of the worker >>>>>>> > suitable for running on portable runners such as the ULR, which >>>>>>> will >>>>>>> > directly transfer to running on production runners such as Dataflow >>>>>>> > and Flink. The target audience should hopefully not be put off by >>>>>>> > running other language code encapsulated in docker images. >>>>>>> > >>>>>>> > ## API >>>>>>> > >>>>>>> > We generally try to apply the concepts from the Beam API in a >>>>>>> > Typescript idiomatic way, but it should be noted that few of the >>>>>>> > initial developers have extensive (if any) Javascript/Typescript >>>>>>> > development experience, so feedback is greatly appreciated. >>>>>>> > >>>>>>> > In addition, some notable departures are taken from the >>>>>>> traditional SDKs: >>>>>>> > >>>>>>> > * We take a "relational foundations" approach, where [schema'd >>>>>>> > data]( >>>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf >>>>>>> ) >>>>>>> > is the primary way to interact with data, and we generally eschew >>>>>>> the >>>>>>> > key-value requiring transforms in favor of a more flexible approach >>>>>>> > naming fields or expressions. Javascript's native Object is used as >>>>>>> > the row type. >>>>>>> > >>>>>>> > * As part of being schema-first we also de-emphasize Coders as a >>>>>>> > first-class concept in the SDK, relegating it to an advance feature >>>>>>> > used for interop. Though we can infer schemas from individual >>>>>>> > elements, it is still TBD to >>>>>>> > figure out if/how we can leverage the type system and/or function >>>>>>> > introspection to regularly infer schemas at construction time. A >>>>>>> > fallback coder using BSON encoding is used when we don't have >>>>>>> > sufficient type information. >>>>>>> > >>>>>>> > * We have added additional methods to the PCollection object, >>>>>>> notably >>>>>>> > `map` and `flatmap`, [rather than only allowing >>>>>>> > apply]( >>>>>>> https://www.mail-archive.com/[email protected]/msg06035.html). >>>>>>> > In addition, `apply` can accept a function argument `(PColletion) >>>>>>> => >>>>>>> > ...` as well as a PTransform subclass, which treats this callable >>>>>>> as >>>>>>> > if it were a PTransform's expand. >>>>>>> > >>>>>>> > * In the other direction, we have eliminated the [problematic >>>>>>> Pipeline >>>>>>> > object](https://s.apache.org/no-beam-pipeline) from the API, >>>>>>> instead >>>>>>> > providing a `Root` PValue on which pipelines are built, and >>>>>>> invoking >>>>>>> > run() on a Runner. We offer a less error-prone `Runner.run` which >>>>>>> > finishes only when the pipeline is completely finished as well as >>>>>>> > `Runner.runAsync` which returns a handle to the running pipeline. >>>>>>> > >>>>>>> > * Rather than introduce PCollectionTuple, PCollectionList, etc. we >>>>>>> let >>>>>>> > PValue literally be an [array or object with PValue >>>>>>> > values]( >>>>>>> https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116 >>>>>>> ) >>>>>>> > which transforms can consume or produce. These are applied by >>>>>>> wrapping >>>>>>> > them with the `P` operator, e.g. `P([pc1, pc2, pc3]).apply(new >>>>>>> > Flatten())`. >>>>>>> > >>>>>>> > * Like Python, `flatMap` and `ParDo.process` return multiple >>>>>>> elements >>>>>>> > by yielding them from a generator, rather than invoking a passed-in >>>>>>> > callback. TBD how to output to multiple distinct PCollections. >>>>>>> There >>>>>>> > is currently an operation to split a PCollection into multiple >>>>>>> > PCollections based on the properties of the elements, and we may >>>>>>> > consider using a callback for side outputs. >>>>>>> > >>>>>>> > * The `map`, `flatmap`, and `ParDo.proceess` methods take an >>>>>>> > additional (optional) context argument, which is similar to the >>>>>>> > keyword arguments used in Python. These can be "ordinary" >>>>>>> javascript >>>>>>> > objects (which are passed as is) or special DoFnParam objects which >>>>>>> > provide getters to element-specific information (such as the >>>>>>> current >>>>>>> > timestamp, window, or side input) at runtime. >>>>>>> > >>>>>>> > * Javascript supports (and encourages) an asynchronous programing >>>>>>> > model, with many libraries requiring use of the async/await >>>>>>> paradigm. >>>>>>> > As there is no way (by design) to go from the asyncronous style >>>>>>> back >>>>>>> > to the synchronous style, this needs to be taken into account when >>>>>>> > designing the API. We currently offer asynchronous variants of >>>>>>> > `PValue.apply(...)` (in addition to the synchronous ones, as they >>>>>>> are >>>>>>> > easier to chain) as well as making `Runner.run` asynchronous. TBD >>>>>>> to >>>>>>> > do this for all user callbacks as well. >>>>>>> > >>>>>>> > ## TODO >>>>>>> > >>>>>>> > This SDK is a work in progress. In January 2022 we developed the >>>>>>> > ability to construct and run basic pipelines (including external >>>>>>> > transforms and running on a portable runner) but the following >>>>>>> > big-ticket items remain. >>>>>>> > >>>>>>> > * Containerization >>>>>>> > >>>>>>> > * Function and object serialization: we currently only support >>>>>>> > "loopback" mode; to be able to run on a remote, distributed manner >>>>>>> we >>>>>>> > need to finish up the work in picking closures and DoFn objects. >>>>>>> Some >>>>>>> > investigation has been started here, but all existing libraries >>>>>>> have >>>>>>> > non-trivial drawbacks. >>>>>>> > >>>>>>> > * Finish the work in building a full SDK container image that >>>>>>> starts >>>>>>> > the worker. >>>>>>> > >>>>>>> > * External transforms >>>>>>> > >>>>>>> > * Using external transforms requires that the external expansion >>>>>>> > service already be started and its address provided. We would >>>>>>> like to >>>>>>> > automatically start it as we do in Python. >>>>>>> > >>>>>>> > * Artifacts are not currently supported, which will be essential >>>>>>> for >>>>>>> > using Java transforms. (All tests use Python.) >>>>>>> > >>>>>>> > * API >>>>>>> > >>>>>>> > * Side inputs are not yet supported. >>>>>>> > >>>>>>> > * There are several TODOs of minor features or design decisions >>>>>>> to finalize. >>>>>>> > >>>>>>> > * Advanced features like metrics, state, timers, and SDF. >>>>>>> Possibly >>>>>>> > some of these can wait. >>>>>>> > >>>>>>> > * Infrastructure >>>>>>> > >>>>>>> > * Gradle and Jenkins integration for tests and style enforcement. >>>>>>> > >>>>>>> > * Other >>>>>>> > >>>>>>> > * Standardize on a way for users to pass PTransform names, and >>>>>>> > enforce unique names for pipeline update. >>>>>>> > >>>>>>> > * Use a Javascript Object rather than proto Struct for pipeline >>>>>>> options. >>>>>>> > >>>>>>> > * Though Dataflow Runner v2 supports portability, submission is >>>>>>> > still done via v1beta3 and interaction with GCS rather than the job >>>>>>> > submission API. >>>>>>> > >>>>>>> > * Properly wait for bundle completion. >>>>>>> > >>>>>>> > There is probably more; there are many TODOs littered throughout >>>>>>> the code. >>>>>>> > >>>>>>> > This code has also not yet been fully peer reviewed (it was the >>>>>>> result >>>>>>> > of a hackathon) which needs to be done before putting it into the >>>>>>> man >>>>>>> > repository. >>>>>>> > >>>>>>> > >>>>>>> > ## Development. >>>>>>> > >>>>>>> > ### Getting stared >>>>>>> > >>>>>>> > Install node.js, and then from within `sdks/node-ts`. >>>>>>> > >>>>>>> > ``` >>>>>>> > npm install >>>>>>> > ``` >>>>>>> > >>>>>>> > ### Running tests >>>>>>> > >>>>>>> > ``` >>>>>>> > $ npm test >>>>>>> > ``` >>>>>>> > >>>>>>> > ### Style >>>>>>> > >>>>>>> > We have adopted prettier which can be run with >>>>>>> > >>>>>>> > ``` >>>>>>> > # npx prettier --write . >>>>>>> > ``` >>>>>>> >>>>>>
