damccorm commented on code in PR #25941: URL: https://github.com/apache/beam/pull/25941#discussion_r1147643456
########## website/www/site/content/en/documentation/sdks/typescript.md: ########## @@ -0,0 +1,146 @@ +--- +type: languages +title: "Apache Beam Typescript SDK" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> +# Apache Beam Typescript SDK + +The Typescript SDK for Apache Beam provides a simple, powerful API for building batch and streaming data processing pipelines. + +## Get started with the Typescript SDK + +Get started with the [Beam Typescript SDK quickstart](/get-started/quickstart/typescript) +to set up your development environment, get the Beam SDK for Typescript, and run an example pipeline. +Then, read through the [Beam programming guide](/documentation/programming-guide) +to learn the basic concepts that apply to all SDKs in Beam. + +## Overview + +We generally try to apply the concepts from the Beam API in a TypeScript +idiomatic way. +In addition, some notable departures are taken from the traditional SDKs: + +* We take a "relational foundations" approach, where +[schema'd data](https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf) +is the primary way to interact with data, and we generally eschew the key-value +requiring transforms in favor of a more flexible approach naming fields or +expressions. For example, we favor the more flexible +[GroupBy](https://beam.apache.org/documentation/programming-guide/#groupbykey) +PTransform over the traditional GroupByKey. +JavaScript's native Object is used as the row type. + +* As part of being schema-first we also de-emphasize Coders as a first-class +concept in the SDK, relegating it to an advanced feature used for interop. +Though we can infer schemas from individual elements, it is still TBD to +figure out if/how we can leverage the type system and/or function introspection +to regularly infer schemas at construction time. A fallback coder using BSON +encoding is used when we don't have sufficient type information. + +* We have added additional methods to the PCollection object, notably `map` +and `flatmap`, [rather than only allowing apply](https://www.mail-archive.com/[email protected]/msg06035.html). +In addition, `apply` can accept a function argument `(PCollection) => ...` as +well as a PTransform subclass, which treats this callable as if it were a +PTransform's expand. + +* In the other direction, we have eliminated the +[problematic Pipeline object](https://s.apache.org/no-beam-pipeline) +from the API, instead providing a `Root` PValue on which pipelines are built, +and invoking run() on a Runner. We offer a less error-prone `Runner.run` +which finishes only when the pipeline is completely finished as well as +`Runner.runAsync` which returns a handle to the running pipeline. + +* Rather than introduce PCollectionTuple, PCollectionList, etc. we let PValue +literally be an +[array or object with PValue values](https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116) +which transforms can consume or produce. +These are applied by wrapping them with the `P` operator, e.g. +`P([pc1, pc2, pc3]).apply(new Flatten())`. + +* Like Python, `flatMap` and `ParDo.process` return multiple elements by +yielding them from a generator, rather than invoking a passed-in callback. +There is currently an operation to split a PCollection into multiple +PCollections based on the properties of the elements, and +we may consider using a callback for side outputs. + +* The `map`, `flatMap`, and `ParDo.process` methods take an additional +(optional) context argument, which is similar to the keyword arguments +used in Python. These are javascript objects whose members may be constants +(which are passed as is) or special DoFnParam objects which provide getters to +element-specific information (such as the current timestamp, window, +or side input) at runtime. + +* Rather than introduce multiple-output complexity into the map/do operations +themselves, producing multiple outputs is done by following with a new +`Split` primitive that takes a +`PCollection<{a?: AType, b: BType, ... }>` and produces an object +`{a: PCollection<AType>, b: PCollection<BType>, ...}`. + +* JavaScript supports (and encourages) an asynchronous programing model, with +many libraries requiring use of the async/await paradigm. +As there is no way (by design) to go from the asynchronous style back to +the synchronous style, this needs to be taken into account +when designing the API. +We currently offer asynchronous variants of `PValue.apply(...)` (in addition +to the synchronous ones, as they are easier to chain) as well as making +`Runner.run` asynchronous. TBD to do this for all user callbacks as well. + +An example pipeline can be found at [wordcount.ts](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/examples/wordcount.ts) +and more documentation can be found in the [beam programming guide](/documentation/programming-guide/). + + +## Pipeline I/O + +See the [Beam-provided I/O Transforms](/documentation/io/built-in/) page for a list of the currently available I/O transforms. + + +## Supported Features + +The Typescript SDK is still under development but already supports many, +but not all, features currently supported by the Beam model, including those required by streaming. Review Comment: "including those required by streaming." - it is ambiguous whether this means these features are or are not included. I think you're saying they're not included since we don't have fine grained watermark-type stuff, but I also think you could write a streaming pipeline using x-lang, so I'm not sure if the specific callout makes sense. ########## website/www/site/content/en/get-started/quickstart/typescript.md: ########## @@ -0,0 +1,187 @@ +--- +title: "Beam Quickstart for Typescript" +aliases: + - /get-started/quickstart/ + - /use/quickstart/ + - /getting-started/ +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Apache Beam Typescript SDK quickstart + +This quickstart shows you how to run an +[example pipeline](https://github.com/apache/beam-starter-typescript) written with +the [Apache Beam Typescript SDK](/documentation/sdks/typescript), using the +[Direct Runner](/documentation/runners/direct/). The Direct Runner executes +pipelines locally on your machine. + +If you're interested in contributing to the Apache Beam Typescript codebase, see the +[Contribution Guide](/contribute). + +On this page: + +{{< toc >}} + +## Set up your development environment + +Make sure you have a [Node.js](https://nodejs.org/) development environment installed. +If you don't, you can download and install it from the +[downloads page](https://nodejs.org/en/download/). + +Due to its extensive use of cross-language transforms, it is recommended that +Python 3 and Java be available on the system as well. + +## Clone the GitHub repository + +Clone or download the +[apache/beam-starter-typescript](https://github.com/apache/beam-starter-typescript) +GitHub repository and change into the `beam-starter-typescript` directory. + +{{< highlight >}} +git clone https://github.com/apache/beam-starter-typescript.git +cd beam-starter-typescript +{{< /highlight >}} + +## Install the project dependences + +Run the following command to install the project's dependencies. + +{{< highlight >}} +npm install +{{< /highlight >}} + +## Compile the pipeline + +The pipeline is then built with + +{{< highlight >}} +npm run build +{{< /highlight >}} + +## Run the quickstart + +Run the following command: + +{{< highlight >}} +node dist/src/main.js --input_text="Greetings" +{{< /highlight >}} + +The output is similar to the following: + +{{< highlight >}} +Hello +World! +Greetings +{{< /highlight >}} + +The lines might appear in a different order. + +## Explore the code + +The main code file for this quickstart is **app.ts** +([GitHub](https://github.com/apache/beam-starter-typescript/blob/main/src/app.ts)). +The code performs the following steps: + +1. Define a Beam pipeline that. + + Creates an initial `PCollection`. + + Apply a transform (map) to the `PCollection`. Review Comment: ```suggestion + Applies a transform (map) to the `PCollection`. ``` nit - consistency -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
