Re: Javascript SDK

Kenneth Knowles Thu, 03 Feb 2022 15:29:45 -0800

We did the same for the Go SDK for some time. I imagine just "not doing the
work to release it" suffices? Maybe +Robert Burke <[email protected]> has
some other memories of how to not release.


Kenn

On Mon, Jan 31, 2022 at 1:05 PM Kerry Donny-Clark <[email protected]>
wrote:

> This project was a great way to kickstart a new SDK. I'd like to bring
> this into Beam and start cleanup. Are there any steps to take before making
> a PR? Is there a way to mark this as experimental/not for release?
> Kerry
>
> On Mon, Jan 17, 2022 at 1:22 AM Pablo Estrada <[email protected]> wrote:
>
>> This project was fun, and I learned a lot putting some time into it. I'd
>> love for it to be brought into the main repository and worked over some
>> time to be fully supported.
>> Best
>> -P.
>>
>> On Fri, Jan 14, 2022 at 4:46 PM Ahmet Altay <[email protected]> wrote:
>>
>>> Really nice! Congratulations to all who worked on this project.
>>>
>>> On Fri, Jan 14, 2022 at 4:41 PM Kenneth Knowles <[email protected]> wrote:
>>>
>>>> This was super fun, and I really hope it can be an inspiration to
>>>> others that you can build a working Beam SDK in a week!
>>>>
>>>> (hint hint https://issues.apache.org/jira/browse/BEAM-4010 and
>>>> https://issues.apache.org/jira/browse/BEAM-12658 :-)
>>>>
>>>> On Fri, Jan 14, 2022 at 11:38 AM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> And, of course, an example:
>>>>>
>>>>>
>>>>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/src/apache_beam/examples/wordcount.ts
>>>>>
>>>>> On Fri, Jan 14, 2022 at 11:35 AM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > Last week at Google we had a hackathon to kick off the new year, and
>>>>> > one of the projects we came up with was seeing how far we could get
>>>>> in
>>>>> > putting together a typescript SDK. Starting from nothing we were able
>>>>> > to make a lot of progress and I wanted to share the results here.
>>>>> >
>>>>> >
>>>>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/README.md
>>>>> >
>>>>> > I think this is an exciting project and look forward to officially
>>>>> > supporting a new language. Clearly there is still a fair amount to
>>>>> do,
>>>>> > and we also need to figure out the best way to get this reviewed
>>>>> (we'd
>>>>> > especially welcome feedback (and contributions) from those, if any,
>>>>> in
>>>>> > the know about javascript/typescript/node even if they're not beam or
>>>>> > distributed computing experts) and into the main repository (assuming
>>>>> > the community is as interested in this as I am).
>>>>> >
>>>>> > The above link is a decent overview, but copying below for posterity
>>>>> > as that will likely evolve over time (e.g. as decisions get made and
>>>>> > TODOs get resolved).
>>>>> >
>>>>> > - Robert
>>>>> >
>>>>> >
>>>>> > --------------------
>>>>> >
>>>>> > # Node Beam SDK
>>>>> >
>>>>> > This is the start of a fully functioning Javascript (actually,
>>>>> > Typescript) SDK. There are two distinct aims with this SDK
>>>>> >
>>>>> > 1. Tap into the large (and relatively underserved, by existing data
>>>>> > processing frameworks) community of javascript developers with a
>>>>> > native SDK targeting this language.
>>>>> >
>>>>> > 1. Develop a new SDK which can serve both as a proof of concept and
>>>>> > reference that highlights the (relative) ease of porting Beam to new
>>>>> > languages, a differentiating feature of Beam and Dataflow.
>>>>> >
>>>>> > To accomplish this, we lean heavily on the portability framework. For
>>>>> > example, we make heavy use of cross-language transforms, in
>>>>> particular
>>>>> > for IOs (as a full SDF implementation may not fit into the week). In
>>>>> > addition, the direct runner is simply an extension of the worker
>>>>> > suitable for running on portable runners such as the ULR, which will
>>>>> > directly transfer to running on production runners such as Dataflow
>>>>> > and Flink. The target audience should hopefully not be put off by
>>>>> > running other language code encapsulated in docker images.
>>>>> >
>>>>> > ## API
>>>>> >
>>>>> > We generally try to apply the concepts from the Beam API in a
>>>>> > Typescript idiomatic way, but it should be noted that few of the
>>>>> > initial developers have extensive (if any) Javascript/Typescript
>>>>> > development experience, so feedback is greatly appreciated.
>>>>> >
>>>>> > In addition, some notable departures are taken from the traditional
>>>>> SDKs:
>>>>> >
>>>>> > * We take a "relational foundations" approach, where [schema'd
>>>>> > data](
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf
>>>>> )
>>>>> > is the primary way to interact with data, and we generally eschew the
>>>>> > key-value requiring transforms in favor of a more flexible approach
>>>>> > naming fields or expressions. Javascript's native Object is used as
>>>>> > the row type.
>>>>> >
>>>>> > * As part of being schema-first we also de-emphasize Coders as a
>>>>> > first-class concept in the SDK, relegating it to an advance feature
>>>>> > used for interop. Though we can infer schemas from individual
>>>>> > elements, it is still TBD to
>>>>> > figure out if/how we can leverage the type system and/or function
>>>>> > introspection to regularly infer schemas at construction time. A
>>>>> > fallback coder using BSON encoding is used when we don't have
>>>>> > sufficient type information.
>>>>> >
>>>>> > * We have added additional methods to the PCollection object, notably
>>>>> > `map` and `flatmap`, [rather than only allowing
>>>>> > apply](
>>>>> https://www.mail-archive.com/[email protected]/msg06035.html).
>>>>> > In addition, `apply` can accept a function argument `(PColletion) =>
>>>>> > ...` as well as a PTransform subclass, which treats this callable as
>>>>> > if it were a PTransform's expand.
>>>>> >
>>>>> > * In the other direction, we have eliminated the [problematic
>>>>> Pipeline
>>>>> > object](https://s.apache.org/no-beam-pipeline) from the API, instead
>>>>> > providing a `Root` PValue on which pipelines are built, and invoking
>>>>> > run() on a Runner.  We offer a less error-prone `Runner.run` which
>>>>> > finishes only when the pipeline is completely finished as well as
>>>>> > `Runner.runAsync` which returns a handle to the running pipeline.
>>>>> >
>>>>> > * Rather than introduce PCollectionTuple, PCollectionList, etc. we
>>>>> let
>>>>> > PValue literally be an [array or object with PValue
>>>>> > values](
>>>>> https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116
>>>>> )
>>>>> > which transforms can consume or produce. These are applied by
>>>>> wrapping
>>>>> > them with the `P` operator, e.g. `P([pc1, pc2, pc3]).apply(new
>>>>> > Flatten())`.
>>>>> >
>>>>> > * Like Python, `flatMap` and `ParDo.process` return multiple elements
>>>>> > by yielding them from a generator, rather than invoking a passed-in
>>>>> > callback. TBD how to output to multiple distinct PCollections. There
>>>>> > is currently an operation to split a PCollection into multiple
>>>>> > PCollections based on the properties of the elements, and we may
>>>>> > consider using a callback for side outputs.
>>>>> >
>>>>> > * The `map`, `flatmap`, and `ParDo.proceess` methods take an
>>>>> > additional (optional) context argument, which is similar to the
>>>>> > keyword arguments used in Python. These can be "ordinary" javascript
>>>>> > objects (which are passed as is) or special DoFnParam objects which
>>>>> > provide getters to element-specific information (such as the current
>>>>> > timestamp, window, or side input) at runtime.
>>>>> >
>>>>> > * Javascript supports (and encourages) an asynchronous programing
>>>>> > model, with many libraries requiring use of the async/await paradigm.
>>>>> > As there is no way (by design) to go from the asyncronous style back
>>>>> > to the synchronous style, this needs to be taken into account when
>>>>> > designing the API. We currently offer asynchronous variants of
>>>>> > `PValue.apply(...)` (in addition to the synchronous ones, as they are
>>>>> > easier to chain) as well as making `Runner.run` asynchronous. TBD to
>>>>> > do this for all user callbacks as well.
>>>>> >
>>>>> > ## TODO
>>>>> >
>>>>> > This SDK is a work in progress. In January 2022 we developed the
>>>>> > ability to construct and run basic pipelines (including external
>>>>> > transforms and running on a portable runner) but the following
>>>>> > big-ticket items remain.
>>>>> >
>>>>> > * Containerization
>>>>> >
>>>>> >   * Function and object serialization: we currently only support
>>>>> > "loopback" mode; to be able to run on a remote, distributed manner we
>>>>> > need to finish up the work in picking closures and DoFn objects. Some
>>>>> > investigation has been started here, but all existing libraries have
>>>>> > non-trivial drawbacks.
>>>>> >
>>>>> >   * Finish the work in building a full SDK container image that
>>>>> starts
>>>>> > the worker.
>>>>> >
>>>>> > * External transforms
>>>>> >
>>>>> >   * Using external transforms requires that the external expansion
>>>>> > service already be started and its address provided.  We would like
>>>>> to
>>>>> > automatically start it as we do in Python.
>>>>> >
>>>>> >   * Artifacts are not currently supported, which will be essential
>>>>> for
>>>>> > using Java transforms. (All tests use Python.)
>>>>> >
>>>>> > * API
>>>>> >
>>>>> >   * Side inputs are not yet supported.
>>>>> >
>>>>> >   * There are several TODOs of minor features or design decisions to
>>>>> finalize.
>>>>> >
>>>>> >   * Advanced features like metrics, state, timers, and SDF. Possibly
>>>>> > some of these can wait.
>>>>> >
>>>>> > * Infrastructure
>>>>> >
>>>>> >   * Gradle and Jenkins integration for tests and style enforcement.
>>>>> >
>>>>> > * Other
>>>>> >
>>>>> >   * Standardize on a way for users to pass PTransform names, and
>>>>> > enforce unique names for pipeline update.
>>>>> >
>>>>> >   * Use a Javascript Object rather than proto Struct for pipeline
>>>>> options.
>>>>> >
>>>>> >   * Though Dataflow Runner v2 supports portability, submission is
>>>>> > still done via v1beta3 and interaction with GCS rather than the job
>>>>> > submission API.
>>>>> >
>>>>> >   * Properly wait for bundle completion.
>>>>> >
>>>>> > There is probably more; there are many TODOs littered throughout the
>>>>> code.
>>>>> >
>>>>> > This code has also not yet been fully peer reviewed (it was the
>>>>> result
>>>>> > of a hackathon) which needs to be done before putting it into the man
>>>>> > repository.
>>>>> >
>>>>> >
>>>>> > ## Development.
>>>>> >
>>>>> > ### Getting stared
>>>>> >
>>>>> > Install node.js, and then from within `sdks/node-ts`.
>>>>> >
>>>>> > ```
>>>>> > npm install
>>>>> > ```
>>>>> >
>>>>> > ### Running tests
>>>>> >
>>>>> > ```
>>>>> > $  npm test
>>>>> > ```
>>>>> >
>>>>> > ### Style
>>>>> >
>>>>> > We have adopted prettier which can be run with
>>>>> >
>>>>> > ```
>>>>> > #  npx prettier --write .
>>>>> > ```
>>>>>
>>>>

Re: Javascript SDK

Reply via email to