Re: Javascript SDK

Austin Bennett Wed, 04 May 2022 17:42:44 -0700

+1 -- had started playing with this a couple weeks ago, is really shaping
up!



Some questions about docs, and making [ developing any language ] more
approachable -->

I wonder whether we have learned enough from this for a guide of sorts for
future language development.  Perhaps, since fresh, would be a good time to
ensure things are noted.

I'm thinking some helpful docs to include somewhere ( if they don't already
exist ), specifically aimed at making it more approachable for someone to
consider starting to work on another language ( ex: i'm thinking of dart,
since I've been writing alot of that lately :-), though there are plenty of
candidate languages )
* What were the tricky points ( language specific? or model specific? )?
* What would be needed to be considered a real MVP?  Ex: could be
considered suggestions - rather than requirements.  Ex: suggested to start
with XXX, and YYY can be more tricky [ at least in ZZZ context ] so
potentially save that for later.
* I'd also like to get a sense of the minimum level of features needed for
something to be accepted into main?  Ex: this one is easier since it was
led by a bunch of well-known-and-core-community members.  But, somewhat
outlining a process would potentially be helpful for people to see there is
a route to making something happen.
* etc...



Also, at what point do we think things marked @Experimental should get on
the website?  I'm thinking about getting on the sdks/language page --
https://beam.apache.org/documentation/sdks/python/  Naturally, is a
function of when someone is willing to do the work, but I also don't know
whether we'd overly want to highlight something that is still
rather-early/experimental on the general website.










On Wed, May 4, 2022 at 3:43 PM Robert Burke <[email protected]> wrote:

> +1 (non-PMC)
>
> On Wed, May 4, 2022, 3:37 PM Ahmet Altay <[email protected]> wrote:
>
>> Thank you!
>>
>> On Wed, May 4, 2022 at 4:22 PM Sachin Agarwal <[email protected]>
>> wrote:
>>
>>> Wow - great work y'all!
>>>
>>> On Wed, May 4, 2022 at 3:21 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> The entire SDK has now been reviewed and all outstanding issues
>>>> addressed. https://github.com/apache/beam/pull/17341 (Big shout out to
>>>> Danny McCormick for his tireless work here!) This does not mean the
>>>> SDK is done, but it's marked as experimental (and isolated) and IMHO
>>>> to a point where we can continue to iterate on the main branch similar
>>>> to how we do our other development.
>>>>
>>>> Any objections or other thoughts on merging?
>>>>
>>>
>> +1 to merging to the main branch.
>>
>>
>>>
>>>> On Mon, Feb 7, 2022 at 9:21 AM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>> >
>>>> > +1 to separating things out if bundling them together becomes too
>>>> > burdensome, though I agree we're not at that point yet (and there is a
>>>> > non-trivial amount of overhead in just doing a release--speaking of
>>>> > which I encourage everyone to look at and vote on the pending RC).
>>>> >
>>>> > That being said, the portability API, and the ability to evolve it in
>>>> > a backwards compatible way with capabilities and requirements, makes
>>>> > it easy to evolve each SDK and Runner independently and not have to
>>>> > worry about which subset of the cross product is actually supported.
>>>> >
>>>> > On Mon, Feb 7, 2022 at 1:44 AM Jan Lukavský <[email protected]> wrote:
>>>> > >
>>>> > > I'll add one note from a different perspective. I think that
>>>> long-term we should consider having separate release cycles for core, SDKs,
>>>> DSLs and runners. It feels releasing all parts as a single "monolith" will
>>>> gradually cause the core parts (e.g. model, runners-core, ...) to be more
>>>> and more expensive to modify, because each modification to these core
>>>> parts, might affect more and more other components. Enabling all SDKs and
>>>> runners to "choose" the supported SDK-core or runner-core (while
>>>> encouraging them to support the most recent!) is more maintainable for the
>>>> future.
>>>> > >
>>>> > > I'm not saying we need to do something right now before merging the
>>>> JS SDK, but on the other hand adding like 10 more SDKs would start to be an
>>>> issue. We probably could talk about if (and how) we could make some sort of
>>>> separation.
>>>> > >
>>>> > >  Jan
>>>> > >
>>>> > > On 2/4/22 18:42, Robert Burke wrote:
>>>> > >
>>>> > > I imagine by the nature of the Apache 2.0 license, the quality of
>>>> the code in a given release is not a given without some other statement by
>>>> the maintainers. We should clear and present warning signs. Erm.
>>>> Experimental labeling.
>>>> > >
>>>> > > On Fri, Feb 4, 2022 at 8:27 AM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >> On Thu, Feb 3, 2022 at 3:43 PM Robert Burke <[email protected]>
>>>> wrote:
>>>> > >>>
>>>> > >>> Personally, if it gets added to the repo at all I'd rather we rip
>>>> off the band-aid and at least have all the tests regularly run, and various
>>>> GitHub actions. Even if we aren't doing the container release activities,
>>>> because it's experimental, that's much better than bit rot and being part
>>>> of the main repo has a simpler contribution convention.
>>>> > >>
>>>> > >>
>>>> > >> 100% agree about bit rot and it also makes it more accessible for
>>>> contribution and experiment. This is a strong motivation for me to get it
>>>> right onto master. Some contributions to branches are probably just unknown
>>>> to a lot of contributors (or adventurous users), for example
>>>> https://github.com/apache/beam/tree/tez-runner
>>>> https://github.com/apache/beam/tree/jstorm-runner
>>>> https://github.com/apache/beam/tree/mr-runner
>>>> > >>
>>>> > >> I'm guessing since node has a distribution via npm if we do
>>>> nothing it is essentially "not released". I don't see it as a big problem
>>>> having it in the archived ASF source releases, as long as licenses and
>>>> whatnot are good, though I may be overlooking something.
>>>> > >>
>>>> > >> Kenn
>>>> > >>
>>>> > >>>
>>>> > >>>
>>>> > >>> Those are my 2 cents.
>>>> > >>> Robert B
>>>> > >>> Beam Go Busybody
>>>> > >>>
>>>> > >>> On Thu, Feb 3, 2022, 3:29 PM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>> > >>>>
>>>> > >>>> We did the same for the Go SDK for some time. I imagine just
>>>> "not doing the work to release it" suffices? Maybe +Robert Burke has some
>>>> other memories of how to not release.
>>>> > >>>>
>>>> > >>>> Kenn
>>>> > >>>>
>>>> > >>>> On Mon, Jan 31, 2022 at 1:05 PM Kerry Donny-Clark <
>>>> [email protected]> wrote:
>>>> > >>>>>
>>>> > >>>>> This project was a great way to kickstart a new SDK. I'd like
>>>> to bring this into Beam and start cleanup. Are there any steps to take
>>>> before making a PR? Is there a way to mark this as experimental/not for
>>>> release?
>>>> > >>>>> Kerry
>>>> > >>>>>
>>>> > >>>>> On Mon, Jan 17, 2022 at 1:22 AM Pablo Estrada <
>>>> [email protected]> wrote:
>>>> > >>>>>>
>>>> > >>>>>> This project was fun, and I learned a lot putting some time
>>>> into it. I'd love for it to be brought into the main repository and worked
>>>> over some time to be fully supported.
>>>> > >>>>>> Best
>>>> > >>>>>> -P.
>>>> > >>>>>>
>>>> > >>>>>> On Fri, Jan 14, 2022 at 4:46 PM Ahmet Altay <[email protected]>
>>>> wrote:
>>>> > >>>>>>>
>>>> > >>>>>>> Really nice! Congratulations to all who worked on this
>>>> project.
>>>> > >>>>>>>
>>>> > >>>>>>> On Fri, Jan 14, 2022 at 4:41 PM Kenneth Knowles <
>>>> [email protected]> wrote:
>>>> > >>>>>>>>
>>>> > >>>>>>>> This was super fun, and I really hope it can be an
>>>> inspiration to others that you can build a working Beam SDK in a week!
>>>> > >>>>>>>>
>>>> > >>>>>>>> (hint hint https://issues.apache.org/jira/browse/BEAM-4010
>>>> and https://issues.apache.org/jira/browse/BEAM-12658 :-)
>>>> > >>>>>>>>
>>>> > >>>>>>>> On Fri, Jan 14, 2022 at 11:38 AM Robert Bradshaw <
>>>> [email protected]> wrote:
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> And, of course, an example:
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>
>>>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/src/apache_beam/examples/wordcount.ts
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> On Fri, Jan 14, 2022 at 11:35 AM Robert Bradshaw <
>>>> [email protected]> wrote:
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > Last week at Google we had a hackathon to kick off the
>>>> new year, and
>>>> > >>>>>>>>> > one of the projects we came up with was seeing how far we
>>>> could get in
>>>> > >>>>>>>>> > putting together a typescript SDK. Starting from nothing
>>>> we were able
>>>> > >>>>>>>>> > to make a lot of progress and I wanted to share the
>>>> results here.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >
>>>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/README.md
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > I think this is an exciting project and look forward to
>>>> officially
>>>> > >>>>>>>>> > supporting a new language. Clearly there is still a fair
>>>> amount to do,
>>>> > >>>>>>>>> > and we also need to figure out the best way to get this
>>>> reviewed (we'd
>>>> > >>>>>>>>> > especially welcome feedback (and contributions) from
>>>> those, if any, in
>>>> > >>>>>>>>> > the know about javascript/typescript/node even if they're
>>>> not beam or
>>>> > >>>>>>>>> > distributed computing experts) and into the main
>>>> repository (assuming
>>>> > >>>>>>>>> > the community is as interested in this as I am).
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > The above link is a decent overview, but copying below
>>>> for posterity
>>>> > >>>>>>>>> > as that will likely evolve over time (e.g. as decisions
>>>> get made and
>>>> > >>>>>>>>> > TODOs get resolved).
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > - Robert
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > --------------------
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > # Node Beam SDK
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > This is the start of a fully functioning Javascript
>>>> (actually,
>>>> > >>>>>>>>> > Typescript) SDK. There are two distinct aims with this SDK
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > 1. Tap into the large (and relatively underserved, by
>>>> existing data
>>>> > >>>>>>>>> > processing frameworks) community of javascript developers
>>>> with a
>>>> > >>>>>>>>> > native SDK targeting this language.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > 1. Develop a new SDK which can serve both as a proof of
>>>> concept and
>>>> > >>>>>>>>> > reference that highlights the (relative) ease of porting
>>>> Beam to new
>>>> > >>>>>>>>> > languages, a differentiating feature of Beam and Dataflow.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > To accomplish this, we lean heavily on the portability
>>>> framework. For
>>>> > >>>>>>>>> > example, we make heavy use of cross-language transforms,
>>>> in particular
>>>> > >>>>>>>>> > for IOs (as a full SDF implementation may not fit into
>>>> the week). In
>>>> > >>>>>>>>> > addition, the direct runner is simply an extension of the
>>>> worker
>>>> > >>>>>>>>> > suitable for running on portable runners such as the ULR,
>>>> which will
>>>> > >>>>>>>>> > directly transfer to running on production runners such
>>>> as Dataflow
>>>> > >>>>>>>>> > and Flink. The target audience should hopefully not be
>>>> put off by
>>>> > >>>>>>>>> > running other language code encapsulated in docker images.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ## API
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > We generally try to apply the concepts from the Beam API
>>>> in a
>>>> > >>>>>>>>> > Typescript idiomatic way, but it should be noted that few
>>>> of the
>>>> > >>>>>>>>> > initial developers have extensive (if any)
>>>> Javascript/Typescript
>>>> > >>>>>>>>> > development experience, so feedback is greatly
>>>> appreciated.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > In addition, some notable departures are taken from the
>>>> traditional SDKs:
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * We take a "relational foundations" approach, where
>>>> [schema'd
>>>> > >>>>>>>>> > data](
>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf
>>>> )
>>>> > >>>>>>>>> > is the primary way to interact with data, and we
>>>> generally eschew the
>>>> > >>>>>>>>> > key-value requiring transforms in favor of a more
>>>> flexible approach
>>>> > >>>>>>>>> > naming fields or expressions. Javascript's native Object
>>>> is used as
>>>> > >>>>>>>>> > the row type.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * As part of being schema-first we also de-emphasize
>>>> Coders as a
>>>> > >>>>>>>>> > first-class concept in the SDK, relegating it to an
>>>> advance feature
>>>> > >>>>>>>>> > used for interop. Though we can infer schemas from
>>>> individual
>>>> > >>>>>>>>> > elements, it is still TBD to
>>>> > >>>>>>>>> > figure out if/how we can leverage the type system and/or
>>>> function
>>>> > >>>>>>>>> > introspection to regularly infer schemas at construction
>>>> time. A
>>>> > >>>>>>>>> > fallback coder using BSON encoding is used when we don't
>>>> have
>>>> > >>>>>>>>> > sufficient type information.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * We have added additional methods to the PCollection
>>>> object, notably
>>>> > >>>>>>>>> > `map` and `flatmap`, [rather than only allowing
>>>> > >>>>>>>>> > apply](
>>>> https://www.mail-archive.com/[email protected]/msg06035.html).
>>>> > >>>>>>>>> > In addition, `apply` can accept a function argument
>>>> `(PColletion) =>
>>>> > >>>>>>>>> > ...` as well as a PTransform subclass, which treats this
>>>> callable as
>>>> > >>>>>>>>> > if it were a PTransform's expand.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * In the other direction, we have eliminated the
>>>> [problematic Pipeline
>>>> > >>>>>>>>> > object](https://s.apache.org/no-beam-pipeline) from the
>>>> API, instead
>>>> > >>>>>>>>> > providing a `Root` PValue on which pipelines are built,
>>>> and invoking
>>>> > >>>>>>>>> > run() on a Runner.  We offer a less error-prone
>>>> `Runner.run` which
>>>> > >>>>>>>>> > finishes only when the pipeline is completely finished as
>>>> well as
>>>> > >>>>>>>>> > `Runner.runAsync` which returns a handle to the running
>>>> pipeline.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * Rather than introduce PCollectionTuple,
>>>> PCollectionList, etc. we let
>>>> > >>>>>>>>> > PValue literally be an [array or object with PValue
>>>> > >>>>>>>>> > values](
>>>> https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116
>>>> )
>>>> > >>>>>>>>> > which transforms can consume or produce. These are
>>>> applied by wrapping
>>>> > >>>>>>>>> > them with the `P` operator, e.g. `P([pc1, pc2,
>>>> pc3]).apply(new
>>>> > >>>>>>>>> > Flatten())`.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * Like Python, `flatMap` and `ParDo.process` return
>>>> multiple elements
>>>> > >>>>>>>>> > by yielding them from a generator, rather than invoking a
>>>> passed-in
>>>> > >>>>>>>>> > callback. TBD how to output to multiple distinct
>>>> PCollections. There
>>>> > >>>>>>>>> > is currently an operation to split a PCollection into
>>>> multiple
>>>> > >>>>>>>>> > PCollections based on the properties of the elements, and
>>>> we may
>>>> > >>>>>>>>> > consider using a callback for side outputs.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * The `map`, `flatmap`, and `ParDo.proceess` methods take
>>>> an
>>>> > >>>>>>>>> > additional (optional) context argument, which is similar
>>>> to the
>>>> > >>>>>>>>> > keyword arguments used in Python. These can be "ordinary"
>>>> javascript
>>>> > >>>>>>>>> > objects (which are passed as is) or special DoFnParam
>>>> objects which
>>>> > >>>>>>>>> > provide getters to element-specific information (such as
>>>> the current
>>>> > >>>>>>>>> > timestamp, window, or side input) at runtime.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * Javascript supports (and encourages) an asynchronous
>>>> programing
>>>> > >>>>>>>>> > model, with many libraries requiring use of the
>>>> async/await paradigm.
>>>> > >>>>>>>>> > As there is no way (by design) to go from the asyncronous
>>>> style back
>>>> > >>>>>>>>> > to the synchronous style, this needs to be taken into
>>>> account when
>>>> > >>>>>>>>> > designing the API. We currently offer asynchronous
>>>> variants of
>>>> > >>>>>>>>> > `PValue.apply(...)` (in addition to the synchronous ones,
>>>> as they are
>>>> > >>>>>>>>> > easier to chain) as well as making `Runner.run`
>>>> asynchronous. TBD to
>>>> > >>>>>>>>> > do this for all user callbacks as well.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ## TODO
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > This SDK is a work in progress. In January 2022 we
>>>> developed the
>>>> > >>>>>>>>> > ability to construct and run basic pipelines (including
>>>> external
>>>> > >>>>>>>>> > transforms and running on a portable runner) but the
>>>> following
>>>> > >>>>>>>>> > big-ticket items remain.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * Containerization
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Function and object serialization: we currently only
>>>> support
>>>> > >>>>>>>>> > "loopback" mode; to be able to run on a remote,
>>>> distributed manner we
>>>> > >>>>>>>>> > need to finish up the work in picking closures and DoFn
>>>> objects. Some
>>>> > >>>>>>>>> > investigation has been started here, but all existing
>>>> libraries have
>>>> > >>>>>>>>> > non-trivial drawbacks.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Finish the work in building a full SDK container
>>>> image that starts
>>>> > >>>>>>>>> > the worker.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * External transforms
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Using external transforms requires that the external
>>>> expansion
>>>> > >>>>>>>>> > service already be started and its address provided.  We
>>>> would like to
>>>> > >>>>>>>>> > automatically start it as we do in Python.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Artifacts are not currently supported, which will be
>>>> essential for
>>>> > >>>>>>>>> > using Java transforms. (All tests use Python.)
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * API
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Side inputs are not yet supported.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * There are several TODOs of minor features or design
>>>> decisions to finalize.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Advanced features like metrics, state, timers, and
>>>> SDF. Possibly
>>>> > >>>>>>>>> > some of these can wait.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * Infrastructure
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Gradle and Jenkins integration for tests and style
>>>> enforcement.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > * Other
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Standardize on a way for users to pass PTransform
>>>> names, and
>>>> > >>>>>>>>> > enforce unique names for pipeline update.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Use a Javascript Object rather than proto Struct for
>>>> pipeline options.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Though Dataflow Runner v2 supports portability,
>>>> submission is
>>>> > >>>>>>>>> > still done via v1beta3 and interaction with GCS rather
>>>> than the job
>>>> > >>>>>>>>> > submission API.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >   * Properly wait for bundle completion.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > There is probably more; there are many TODOs littered
>>>> throughout the code.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > This code has also not yet been fully peer reviewed (it
>>>> was the result
>>>> > >>>>>>>>> > of a hackathon) which needs to be done before putting it
>>>> into the man
>>>> > >>>>>>>>> > repository.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ## Development.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ### Getting stared
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > Install node.js, and then from within `sdks/node-ts`.
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ```
>>>> > >>>>>>>>> > npm install
>>>> > >>>>>>>>> > ```
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ### Running tests
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ```
>>>> > >>>>>>>>> > $  npm test
>>>> > >>>>>>>>> > ```
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ### Style
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > We have adopted prettier which can be run with
>>>> > >>>>>>>>> >
>>>> > >>>>>>>>> > ```
>>>> > >>>>>>>>> > #  npx prettier --write .
>>>> > >>>>>>>>> > ```
>>>>
>>>

Re: Javascript SDK

Reply via email to