Re: Javascript SDK

Jan Lukavský Mon, 07 Feb 2022 01:44:58 -0800

I'll add one note from a different perspective. I think that long-termwe should consider having separate release cycles for core, SDKs, DSLsand runners. It feels releasing all parts as a single "monolith" willgradually cause the core parts (e.g. model, runners-core, ...) to bemore and more expensive to modify, because each modification to thesecore parts, might affect more and more other components. Enabling allSDKs and runners to "choose" the supported SDK-core or runner-core(while encouraging them to support the most recent!) is moremaintainable for the future.

I'm not saying we need to do something right now before merging the JSSDK, but on the other hand adding like 10 more SDKs would start to be anissue. We probably could talk about if (and how) we could make some sortof separation.


 Jan

On 2/4/22 18:42, Robert Burke wrote:

I imagine by the nature of the Apache 2.0 license, the quality of thecode in a given release is not a given without some other statement bythe maintainers. We should clear and present warning signs. Erm.Experimental labeling.


On Fri, Feb 4, 2022 at 8:27 AM Kenneth Knowles <[email protected]> wrote:



    On Thu, Feb 3, 2022 at 3:43 PM Robert Burke <[email protected]>
    wrote:

        Personally, if it gets added to the repo at all I'd rather we
        rip off the band-aid and at least have all the tests regularly
        run, and various GitHub actions. Even if we aren't doing the
        container release activities, because it's experimental,
        that's much better than bit rot and being part of the main
        repo has a simpler contribution convention.


    100% agree about bit rot and it also makes it more accessible
    for contribution and experiment. This is a strong motivation for
    me to get it right onto master. Some contributions to branches are
    probably just unknown to a lot of contributors (or adventurous
    users), for example https://github.com/apache/beam/tree/tez-runner
    https://github.com/apache/beam/tree/jstorm-runner
    https://github.com/apache/beam/tree/mr-runner

    I'm guessing since node has a distribution via npm if we do
    nothing it is essentially "not released". I don't see it as a big
    problem having it in the archived ASF source releases, as long as
    licenses and whatnot are good, though I may be overlooking something.

    Kenn


        Those are my 2 cents.
        Robert B
        Beam Go Busybody

        On Thu, Feb 3, 2022, 3:29 PM Kenneth Knowles <[email protected]>
        wrote:

            We did the same for the Go SDK for some time. I imagine
            just "not doing the work to release it" suffices? Maybe
            +Robert Burke <mailto:[email protected]> has some other
            memories of how to not release.

            Kenn

            On Mon, Jan 31, 2022 at 1:05 PM Kerry Donny-Clark
            <[email protected]> wrote:

                This project was a great way to kickstart a new SDK.
                I'd like to bring this into Beam and start cleanup.
                Are there any steps to take before making a PR? Is
                there a way to mark this as experimental/not for release?
                Kerry

                On Mon, Jan 17, 2022 at 1:22 AM Pablo Estrada
                <[email protected]> wrote:

                    This project was fun, and I learned a lot putting
                    some time into it. I'd love for it to be brought
                    into the main repository and worked over some time
                    to be fully supported.
                    Best
                    -P.

                    On Fri, Jan 14, 2022 at 4:46 PM Ahmet Altay
                    <[email protected]> wrote:

                        Really nice! Congratulations to all who worked
                        on this project.

                        On Fri, Jan 14, 2022 at 4:41 PM Kenneth
                        Knowles <[email protected]> wrote:

                            This was super fun, and I really hope it
                            can be an inspiration to others that you
                            can build a working Beam SDK in a week!

                            (hint hint
                            https://issues.apache.org/jira/browse/BEAM-4010
                            and
                            https://issues.apache.org/jira/browse/BEAM-12658
                            :-)

                            On Fri, Jan 14, 2022 at 11:38 AM Robert
                            Bradshaw <[email protected]> wrote:

                                And, of course, an example:

                                
https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/src/apache_beam/examples/wordcount.ts

                                On Fri, Jan 14, 2022 at 11:35 AM
                                Robert Bradshaw <[email protected]>
                                wrote:
                                >
                                > Last week at Google we had a
                                hackathon to kick off the new year, and
                                > one of the projects we came up with
                                was seeing how far we could get in
                                > putting together a typescript SDK.
                                Starting from nothing we were able
                                > to make a lot of progress and I
                                wanted to share the results here.
                                >
                                >
                                
https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/README.md
                                >
                                > I think this is an exciting project
                                and look forward to officially
                                > supporting a new language. Clearly
                                there is still a fair amount to do,
                                > and we also need to figure out the
                                best way to get this reviewed (we'd
                                > especially welcome feedback (and
                                contributions) from those, if any, in
                                > the know about
                                javascript/typescript/node even if
                                they're not beam or
                                > distributed computing experts) and
                                into the main repository (assuming
                                > the community is as interested in
                                this as I am).
                                >
                                > The above link is a decent overview,
                                but copying below for posterity
                                > as that will likely evolve over time
                                (e.g. as decisions get made and
                                > TODOs get resolved).
                                >
                                > - Robert
                                >
                                >
                                > --------------------
                                >
                                > # Node Beam SDK
                                >
                                > This is the start of a fully
                                functioning Javascript (actually,
                                > Typescript) SDK. There are two
                                distinct aims with this SDK
                                >
                                > 1. Tap into the large (and
                                relatively underserved, by existing data
                                > processing frameworks) community of
                                javascript developers with a
                                > native SDK targeting this language.
                                >
                                > 1. Develop a new SDK which can serve
                                both as a proof of concept and
                                > reference that highlights the
                                (relative) ease of porting Beam to new
                                > languages, a differentiating feature
                                of Beam and Dataflow.
                                >
                                > To accomplish this, we lean heavily
                                on the portability framework. For
                                > example, we make heavy use of
                                cross-language transforms, in particular
                                > for IOs (as a full SDF
                                implementation may not fit into the
                                week). In
                                > addition, the direct runner is
                                simply an extension of the worker
                                > suitable for running on portable
                                runners such as the ULR, which will
                                > directly transfer to running on
                                production runners such as Dataflow
                                > and Flink. The target audience
                                should hopefully not be put off by
                                > running other language code
                                encapsulated in docker images.
                                >
                                > ## API
                                >
                                > We generally try to apply the
                                concepts from the Beam API in a
                                > Typescript idiomatic way, but it
                                should be noted that few of the
                                > initial developers have extensive
                                (if any) Javascript/Typescript
                                > development experience, so feedback
                                is greatly appreciated.
                                >
                                > In addition, some notable departures
                                are taken from the traditional SDKs:
                                >
                                > * We take a "relational foundations"
                                approach, where [schema'd
                                >
                                
data](https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf)
                                > is the primary way to interact with
                                data, and we generally eschew the
                                > key-value requiring transforms in
                                favor of a more flexible approach
                                > naming fields or expressions.
                                Javascript's native Object is used as
                                > the row type.
                                >
                                > * As part of being schema-first we
                                also de-emphasize Coders as a
                                > first-class concept in the SDK,
                                relegating it to an advance feature
                                > used for interop. Though we can
                                infer schemas from individual
                                > elements, it is still TBD to
                                > figure out if/how we can leverage
                                the type system and/or function
                                > introspection to regularly infer
                                schemas at construction time. A
                                > fallback coder using BSON encoding
                                is used when we don't have
                                > sufficient type information.
                                >
                                > * We have added additional methods
                                to the PCollection object, notably
                                > `map` and `flatmap`, [rather than
                                only allowing
                                >
                                
apply](https://www.mail-archive.com/[email protected]/msg06035.html).
                                > In addition, `apply` can accept a
                                function argument `(PColletion) =>
                                > ...` as well as a PTransform
                                subclass, which treats this callable as
                                > if it were a PTransform's expand.
                                >
                                > * In the other direction, we have
                                eliminated the [problematic Pipeline
                                >
                                object](https://s.apache.org/no-beam-pipeline)
                                from the API, instead
                                > providing a `Root` PValue on which
                                pipelines are built, and invoking
                                > run() on a Runner.  We offer a less
                                error-prone `Runner.run` which
                                > finishes only when the pipeline is
                                completely finished as well as
                                > `Runner.runAsync` which returns a
                                handle to the running pipeline.
                                >
                                > * Rather than introduce
                                PCollectionTuple, PCollectionList,
                                etc. we let
                                > PValue literally be an [array or
                                object with PValue
                                >
                                
values](https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116)
                                > which transforms can consume or
                                produce. These are applied by wrapping
                                > them with the `P` operator, e.g.
                                `P([pc1, pc2, pc3]).apply(new
                                > Flatten())`.
                                >
                                > * Like Python, `flatMap` and
                                `ParDo.process` return multiple elements
                                > by yielding them from a generator,
                                rather than invoking a passed-in
                                > callback. TBD how to output to
                                multiple distinct PCollections. There
                                > is currently an operation to split a
                                PCollection into multiple
                                > PCollections based on the properties
                                of the elements, and we may
                                > consider using a callback for side
                                outputs.
                                >
                                > * The `map`, `flatmap`, and
                                `ParDo.proceess` methods take an
                                > additional (optional) context
                                argument, which is similar to the
                                > keyword arguments used in Python.
                                These can be "ordinary" javascript
                                > objects (which are passed as is) or
                                special DoFnParam objects which
                                > provide getters to element-specific
                                information (such as the current
                                > timestamp, window, or side input) at
                                runtime.
                                >
                                > * Javascript supports (and
                                encourages) an asynchronous programing
                                > model, with many libraries requiring
                                use of the async/await paradigm.
                                > As there is no way (by design) to go
                                from the asyncronous style back
                                > to the synchronous style, this needs
                                to be taken into account when
                                > designing the API. We currently
                                offer asynchronous variants of
                                > `PValue.apply(...)` (in addition to
                                the synchronous ones, as they are
                                > easier to chain) as well as making
                                `Runner.run` asynchronous. TBD to
                                > do this for all user callbacks as well.
                                >
                                > ## TODO
                                >
                                > This SDK is a work in progress. In
                                January 2022 we developed the
                                > ability to construct and run basic
                                pipelines (including external
                                > transforms and running on a portable
                                runner) but the following
                                > big-ticket items remain.
                                >
                                > * Containerization
                                >
                                >   * Function and object
                                serialization: we currently only support
                                > "loopback" mode; to be able to run
                                on a remote, distributed manner we
                                > need to finish up the work in
                                picking closures and DoFn objects. Some
                                > investigation has been started here,
                                but all existing libraries have
                                > non-trivial drawbacks.
                                >
                                >   * Finish the work in building a
                                full SDK container image that starts
                                > the worker.
                                >
                                > * External transforms
                                >
                                >   * Using external transforms
                                requires that the external expansion
                                > service already be started and its
                                address provided.  We would like to
                                > automatically start it as we do in
                                Python.
                                >
                                >   * Artifacts are not currently
                                supported, which will be essential for
                                > using Java transforms. (All tests
                                use Python.)
                                >
                                > * API
                                >
                                >   * Side inputs are not yet supported.
                                >
                                >   * There are several TODOs of minor
                                features or design decisions to finalize.
                                >
                                >   * Advanced features like metrics,
                                state, timers, and SDF. Possibly
                                > some of these can wait.
                                >
                                > * Infrastructure
                                >
                                >   * Gradle and Jenkins integration
                                for tests and style enforcement.
                                >
                                > * Other
                                >
                                >   * Standardize on a way for users
                                to pass PTransform names, and
                                > enforce unique names for pipeline
                                update.
                                >
                                >   * Use a Javascript Object rather
                                than proto Struct for pipeline options.
                                >
                                >   * Though Dataflow Runner v2
                                supports portability, submission is
                                > still done via v1beta3 and
                                interaction with GCS rather than the job
                                > submission API.
                                >
                                >   * Properly wait for bundle completion.
                                >
                                > There is probably more; there are
                                many TODOs littered throughout the code.
                                >
                                > This code has also not yet been
                                fully peer reviewed (it was the result
                                > of a hackathon) which needs to be
                                done before putting it into the man
                                > repository.
                                >
                                >
                                > ## Development.
                                >
                                > ### Getting stared
                                >
                                > Install node.js, and then from
                                within `sdks/node-ts`.
                                >
                                > ```
                                > npm install
                                > ```
                                >
                                > ### Running tests
                                >
                                > ```
                                > $  npm test
                                > ```
                                >
                                > ### Style
                                >
                                > We have adopted prettier which can
                                be run with
                                >
                                > ```
                                > #  npx prettier --write .
                                > ```

Re: Javascript SDK

Reply via email to