This is an automated email from the ASF dual-hosted git repository. boyuanz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push: new 4c51569 Add splittable dofn as the recommended way of building connectors. new 3de140f Merge pull request #13227 from [BEAM-10480] Add splittable dofn as the recommended way of building connectors. 4c51569 is described below commit 4c51569d3b972e2271efcc520a48ccb1bd20c9be Author: Boyuan Zhang <boyu...@google.com> AuthorDate: Thu Oct 29 15:13:35 2020 -0700 Add splittable dofn as the recommended way of building connectors. --- .../en/documentation/io/developing-io-java.md | 3 + .../en/documentation/io/developing-io-overview.md | 80 +++++++++++++--------- .../en/documentation/io/developing-io-python.md | 3 + 3 files changed, 53 insertions(+), 33 deletions(-) diff --git a/website/www/site/content/en/documentation/io/developing-io-java.md b/website/www/site/content/en/documentation/io/developing-io-java.md index 7de2024..7836a3c 100644 --- a/website/www/site/content/en/documentation/io/developing-io-java.md +++ b/website/www/site/content/en/documentation/io/developing-io-java.md @@ -17,6 +17,9 @@ limitations under the License. --> # Developing I/O connectors for Java +**IMPORTANT:** Use ``Splittable DoFn`` to develop your new I/O. For more details, read the +[new I/O connector overview](/documentation/io/developing-io-overview/). + To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector that usually consist of a source and a sink. All Beam sources and sinks are composite transforms; however, diff --git a/website/www/site/content/en/documentation/io/developing-io-overview.md b/website/www/site/content/en/documentation/io/developing-io-overview.md index 0ea507f..c8e0482 100644 --- a/website/www/site/content/en/documentation/io/developing-io-overview.md +++ b/website/www/site/content/en/documentation/io/developing-io-overview.md @@ -46,33 +46,32 @@ are the recommended steps to get started: For **bounded (batch) sources**, there are currently two options for creating a Beam source: +1. Use `Splittable DoFn`. + 1. Use `ParDo` and `GroupByKey`. -1. Use the `Source` interface and extend the `BoundedSource` abstract subclass. -`ParDo` is the recommended option, as implementing a `Source` can be tricky. See -[When to use the Source interface](#when-to-use-source) for a list of some use -cases where you might want to use a `Source` (such as -[dynamic work rebalancing](/blog/2016/05/18/splitAtFraction-method.html)). +`Splittable DoFn` is the recommended option, as it's the most recent source framework for both +bounded and unbounded sources. This is meant to replace the `Source` APIs( +[BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html) and +[UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html)) +in the new system. Read +[Splittable DoFn Programming Guide](/learn/programming-guide/#splittable-dofns) for how to write one +Splittable DoFn. For more information, see the +[roadmap for multi-SDK connector efforts](/roadmap/connectors-multi-sdk/). -(Java only) For **unbounded (streaming) sources**, you must use the `Source` -interface and extend the `UnboundedSource` abstract subclass. `UnboundedSource` -supports features that are useful for streaming pipelines, such as -checkpointing. +For Java and Python **unbounded (streaming) sources**, you must use the `Splittable DoFn`, which +supports features that are useful for streaming pipelines, including checkpointing, controlling +watermark, and tracking backlog. -Splittable DoFn is a new sources framework that is under development and will -replace the other options for developing bounded and unbounded sources. For more -information, see the -[roadmap for multi-SDK connector efforts](/roadmap/connectors-multi-sdk/). -### When to use the Source interface {#when-to-use-source} +### When to use the Splittable DoFn interface {#when-to-use-splittable-dofn} -If you are not sure whether to use `Source`, feel free to email the [Beam dev -mailing list](/get-started/support) and we can discuss the -specific pros and cons of your case. +If you are not sure whether to use `Splittable DoFn`, feel free to email the +[Beam dev mailing list](/get-started/support) and we can discuss the specific pros and cons of your +case. -In some cases, implementing a `Source` might be necessary or result in better -performance: +In some cases, implementing a `Splittable DoFn` might be necessary or result in better performance: * **Unbounded sources:** `ParDo` does not work for reading from unbounded sources. `ParDo` does not support checkpointing or mechanisms like de-duping @@ -90,22 +89,40 @@ performance: jobs. Depending on your data source, dynamic work rebalancing might not be possible. -* **Splitting into parts of particular size recommended by the runner:** `ParDo` - does not receive `desired_bundle_size` as a hint from runners when performing - initial splitting. +* **Splitting initially to increase parallelism:** `ParDo` + does not have the ability to perform initial splitting. For example, if you'd like to read from a new file format that contains many records per file, or if you'd like to read from a key-value store that supports read operations in sorted key order. -### Source lifecycle {#source} -Here is a sequence diagram that shows the lifecycle of the Source during - the execution of the Read transform of an IO. The comments give useful - information to IO developers such as the constraints that - apply to the objects or particular cases such as streaming mode. - - <!-- The source for the sequence diagram can be found in the the SVG resource. --> -![This is a sequence diagram that shows the lifecycle of the Source](/images/source-sequence-diagram.svg) +### I/O examples using SDFs +**Java Examples** + +* [Kafka](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118): +An I/O connector for [Apache Kafka](https://kafka.apache.org/) +(an open-source distributed event streaming platform). +* [Watch](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787): +Uses a polling function producing a growing set of outputs for each input until a per-input +termination condition is met. +* [Parquet](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365): +An I/O connector for [Apache Parquet](https://parquet.apache.org/) +(an open-source columnar storage format). +* [HL7v2](https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493): +An I/O connector for HL7v2 messages (a clinical messaging format that provides data about events +that occur inside an organization) part of +[Google’s Cloud Healthcare API](https://cloud.google.com/healthcare). +* [BoundedSource wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248): +A wrapper which converts an existing [BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html) +implementation to a splittable DoFn. +* [UnboundedSource wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432): +A wrapper which converts an existing [UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html) +implementation to a splittable DoFn. + +**Python Examples** +* [BoundedSourceWrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375): +A wrapper which converts an existing [BoundedSource](https://beam.apache.org/releases/pydoc/current/apache_beam.io.iobase.html#apache_beam.io.iobase.BoundedSource) +implementation to a splittable DoFn. ### Using ParDo and GroupByKey @@ -157,7 +174,6 @@ example: cannot be parallelized. In this case, the `ParDo` would open the file and read in sequence, producing a `PCollection` of records from the file. - ## Sinks To create a Beam sink, we recommend that you use a `ParDo` that writes the @@ -169,8 +185,6 @@ For **file-based sinks**, you can use the `FileBasedSink` abstraction that is provided by both the Java and Python SDKs. See our language specific implementation guides for more details: -* [Developing I/O connectors for Java](/documentation/io/developing-io-java/) -* [Developing I/O connectors for Python](/documentation/io/developing-io-python/) diff --git a/website/www/site/content/en/documentation/io/developing-io-python.md b/website/www/site/content/en/documentation/io/developing-io-python.md index 039b633..7c7705b 100644 --- a/website/www/site/content/en/documentation/io/developing-io-python.md +++ b/website/www/site/content/en/documentation/io/developing-io-python.md @@ -19,6 +19,9 @@ limitations under the License. --> # Developing I/O connectors for Python +**IMPORTANT:** Please use ``Splittable DoFn`` to develop your new I/O. For more details, please read +the [new I/O connector overview](/documentation/io/developing-io-overview/). + To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector that usually consist of a source and a sink. All Beam sources and sinks are composite transforms; however,