I imagine by the nature of the Apache 2.0 license, the quality of the
code in a given release is not a given without some other statement by
the maintainers. We should clear and present warning signs. Erm.
Experimental labeling.
On Fri, Feb 4, 2022 at 8:27 AM Kenneth Knowles <[email protected]> wrote:
On Thu, Feb 3, 2022 at 3:43 PM Robert Burke <[email protected]>
wrote:
Personally, if it gets added to the repo at all I'd rather we
rip off the band-aid and at least have all the tests regularly
run, and various GitHub actions. Even if we aren't doing the
container release activities, because it's experimental,
that's much better than bit rot and being part of the main
repo has a simpler contribution convention.
100% agree about bit rot and it also makes it more accessible
for contribution and experiment. This is a strong motivation for
me to get it right onto master. Some contributions to branches are
probably just unknown to a lot of contributors (or adventurous
users), for example https://github.com/apache/beam/tree/tez-runner
https://github.com/apache/beam/tree/jstorm-runner
https://github.com/apache/beam/tree/mr-runner
I'm guessing since node has a distribution via npm if we do
nothing it is essentially "not released". I don't see it as a big
problem having it in the archived ASF source releases, as long as
licenses and whatnot are good, though I may be overlooking something.
Kenn
Those are my 2 cents.
Robert B
Beam Go Busybody
On Thu, Feb 3, 2022, 3:29 PM Kenneth Knowles <[email protected]>
wrote:
We did the same for the Go SDK for some time. I imagine
just "not doing the work to release it" suffices? Maybe
+Robert Burke <mailto:[email protected]> has some other
memories of how to not release.
Kenn
On Mon, Jan 31, 2022 at 1:05 PM Kerry Donny-Clark
<[email protected]> wrote:
This project was a great way to kickstart a new SDK.
I'd like to bring this into Beam and start cleanup.
Are there any steps to take before making a PR? Is
there a way to mark this as experimental/not for release?
Kerry
On Mon, Jan 17, 2022 at 1:22 AM Pablo Estrada
<[email protected]> wrote:
This project was fun, and I learned a lot putting
some time into it. I'd love for it to be brought
into the main repository and worked over some time
to be fully supported.
Best
-P.
On Fri, Jan 14, 2022 at 4:46 PM Ahmet Altay
<[email protected]> wrote:
Really nice! Congratulations to all who worked
on this project.
On Fri, Jan 14, 2022 at 4:41 PM Kenneth
Knowles <[email protected]> wrote:
This was super fun, and I really hope it
can be an inspiration to others that you
can build a working Beam SDK in a week!
(hint hint
https://issues.apache.org/jira/browse/BEAM-4010
and
https://issues.apache.org/jira/browse/BEAM-12658
:-)
On Fri, Jan 14, 2022 at 11:38 AM Robert
Bradshaw <[email protected]> wrote:
And, of course, an example:
https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/src/apache_beam/examples/wordcount.ts
On Fri, Jan 14, 2022 at 11:35 AM
Robert Bradshaw <[email protected]>
wrote:
>
> Last week at Google we had a
hackathon to kick off the new year, and
> one of the projects we came up with
was seeing how far we could get in
> putting together a typescript SDK.
Starting from nothing we were able
> to make a lot of progress and I
wanted to share the results here.
>
>
https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/README.md
>
> I think this is an exciting project
and look forward to officially
> supporting a new language. Clearly
there is still a fair amount to do,
> and we also need to figure out the
best way to get this reviewed (we'd
> especially welcome feedback (and
contributions) from those, if any, in
> the know about
javascript/typescript/node even if
they're not beam or
> distributed computing experts) and
into the main repository (assuming
> the community is as interested in
this as I am).
>
> The above link is a decent overview,
but copying below for posterity
> as that will likely evolve over time
(e.g. as decisions get made and
> TODOs get resolved).
>
> - Robert
>
>
> --------------------
>
> # Node Beam SDK
>
> This is the start of a fully
functioning Javascript (actually,
> Typescript) SDK. There are two
distinct aims with this SDK
>
> 1. Tap into the large (and
relatively underserved, by existing data
> processing frameworks) community of
javascript developers with a
> native SDK targeting this language.
>
> 1. Develop a new SDK which can serve
both as a proof of concept and
> reference that highlights the
(relative) ease of porting Beam to new
> languages, a differentiating feature
of Beam and Dataflow.
>
> To accomplish this, we lean heavily
on the portability framework. For
> example, we make heavy use of
cross-language transforms, in particular
> for IOs (as a full SDF
implementation may not fit into the
week). In
> addition, the direct runner is
simply an extension of the worker
> suitable for running on portable
runners such as the ULR, which will
> directly transfer to running on
production runners such as Dataflow
> and Flink. The target audience
should hopefully not be put off by
> running other language code
encapsulated in docker images.
>
> ## API
>
> We generally try to apply the
concepts from the Beam API in a
> Typescript idiomatic way, but it
should be noted that few of the
> initial developers have extensive
(if any) Javascript/Typescript
> development experience, so feedback
is greatly appreciated.
>
> In addition, some notable departures
are taken from the traditional SDKs:
>
> * We take a "relational foundations"
approach, where [schema'd
>
data](https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf)
> is the primary way to interact with
data, and we generally eschew the
> key-value requiring transforms in
favor of a more flexible approach
> naming fields or expressions.
Javascript's native Object is used as
> the row type.
>
> * As part of being schema-first we
also de-emphasize Coders as a
> first-class concept in the SDK,
relegating it to an advance feature
> used for interop. Though we can
infer schemas from individual
> elements, it is still TBD to
> figure out if/how we can leverage
the type system and/or function
> introspection to regularly infer
schemas at construction time. A
> fallback coder using BSON encoding
is used when we don't have
> sufficient type information.
>
> * We have added additional methods
to the PCollection object, notably
> `map` and `flatmap`, [rather than
only allowing
>
apply](https://www.mail-archive.com/[email protected]/msg06035.html).
> In addition, `apply` can accept a
function argument `(PColletion) =>
> ...` as well as a PTransform
subclass, which treats this callable as
> if it were a PTransform's expand.
>
> * In the other direction, we have
eliminated the [problematic Pipeline
>
object](https://s.apache.org/no-beam-pipeline)
from the API, instead
> providing a `Root` PValue on which
pipelines are built, and invoking
> run() on a Runner. We offer a less
error-prone `Runner.run` which
> finishes only when the pipeline is
completely finished as well as
> `Runner.runAsync` which returns a
handle to the running pipeline.
>
> * Rather than introduce
PCollectionTuple, PCollectionList,
etc. we let
> PValue literally be an [array or
object with PValue
>
values](https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116)
> which transforms can consume or
produce. These are applied by wrapping
> them with the `P` operator, e.g.
`P([pc1, pc2, pc3]).apply(new
> Flatten())`.
>
> * Like Python, `flatMap` and
`ParDo.process` return multiple elements
> by yielding them from a generator,
rather than invoking a passed-in
> callback. TBD how to output to
multiple distinct PCollections. There
> is currently an operation to split a
PCollection into multiple
> PCollections based on the properties
of the elements, and we may
> consider using a callback for side
outputs.
>
> * The `map`, `flatmap`, and
`ParDo.proceess` methods take an
> additional (optional) context
argument, which is similar to the
> keyword arguments used in Python.
These can be "ordinary" javascript
> objects (which are passed as is) or
special DoFnParam objects which
> provide getters to element-specific
information (such as the current
> timestamp, window, or side input) at
runtime.
>
> * Javascript supports (and
encourages) an asynchronous programing
> model, with many libraries requiring
use of the async/await paradigm.
> As there is no way (by design) to go
from the asyncronous style back
> to the synchronous style, this needs
to be taken into account when
> designing the API. We currently
offer asynchronous variants of
> `PValue.apply(...)` (in addition to
the synchronous ones, as they are
> easier to chain) as well as making
`Runner.run` asynchronous. TBD to
> do this for all user callbacks as well.
>
> ## TODO
>
> This SDK is a work in progress. In
January 2022 we developed the
> ability to construct and run basic
pipelines (including external
> transforms and running on a portable
runner) but the following
> big-ticket items remain.
>
> * Containerization
>
> * Function and object
serialization: we currently only support
> "loopback" mode; to be able to run
on a remote, distributed manner we
> need to finish up the work in
picking closures and DoFn objects. Some
> investigation has been started here,
but all existing libraries have
> non-trivial drawbacks.
>
> * Finish the work in building a
full SDK container image that starts
> the worker.
>
> * External transforms
>
> * Using external transforms
requires that the external expansion
> service already be started and its
address provided. We would like to
> automatically start it as we do in
Python.
>
> * Artifacts are not currently
supported, which will be essential for
> using Java transforms. (All tests
use Python.)
>
> * API
>
> * Side inputs are not yet supported.
>
> * There are several TODOs of minor
features or design decisions to finalize.
>
> * Advanced features like metrics,
state, timers, and SDF. Possibly
> some of these can wait.
>
> * Infrastructure
>
> * Gradle and Jenkins integration
for tests and style enforcement.
>
> * Other
>
> * Standardize on a way for users
to pass PTransform names, and
> enforce unique names for pipeline
update.
>
> * Use a Javascript Object rather
than proto Struct for pipeline options.
>
> * Though Dataflow Runner v2
supports portability, submission is
> still done via v1beta3 and
interaction with GCS rather than the job
> submission API.
>
> * Properly wait for bundle completion.
>
> There is probably more; there are
many TODOs littered throughout the code.
>
> This code has also not yet been
fully peer reviewed (it was the result
> of a hackathon) which needs to be
done before putting it into the man
> repository.
>
>
> ## Development.
>
> ### Getting stared
>
> Install node.js, and then from
within `sdks/node-ts`.
>
> ```
> npm install
> ```
>
> ### Running tests
>
> ```
> $ npm test
> ```
>
> ### Style
>
> We have adopted prettier which can
be run with
>
> ```
> # npx prettier --write .
> ```