This is an automated email from the ASF dual-hosted git repository. git-site-role pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/asf-site by this push: new f21b5e2 Publishing website 2020/12/02 00:01:22 at commit 83ed9ae f21b5e2 is described below commit f21b5e2e4e1a7fe4de1a1a7e18cc56e679921a2c Author: jenkins <bui...@apache.org> AuthorDate: Wed Dec 2 00:01:22 2020 +0000 Publishing website 2020/12/02 00:01:22 at commit 83ed9ae --- website/generated-content/documentation/index.xml | 8 ++++ .../documentation/io/developing-io-java/index.html | 3 +- .../io/developing-io-overview/index.html | 54 +++++++++++++--------- .../io/developing-io-python/index.html | 3 +- .../documentation/programming-guide/index.html | 4 ++ website/generated-content/sitemap.xml | 2 +- 6 files changed, 49 insertions(+), 25 deletions(-) diff --git a/website/generated-content/documentation/index.xml b/website/generated-content/documentation/index.xml index 4841e74..38831bf 100644 --- a/website/generated-content/documentation/index.xml +++ b/website/generated-content/documentation/index.xml @@ -307,6 +307,8 @@ See the License for the specific language governing permissions and limitations under the License. --> <h1 id="developing-io-connectors-for-java">Developing I/O connectors for Java</h1> +<p><strong>IMPORTANT:</strong> Use <code>Splittable DoFn</code> to develop your new I/O. For more details, read the +<a href="/documentation/io/developing-io-overview/">new I/O connector overview</a>.</p> <p>To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector that usually consist of a source and a sink. All Beam sources and sinks are composite transforms; however, @@ -655,6 +657,8 @@ See the License for the specific language governing permissions and limitations under the License. --> <h1 id="developing-io-connectors-for-python">Developing I/O connectors for Python</h1> +<p><strong>IMPORTANT:</strong> Please use <code>Splittable DoFn</code> to develop your new I/O. For more details, please read +the <a href="/documentation/io/developing-io-overview/">new I/O connector overview</a>.</p> <p>To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector that usually consist of a source and a sink. All Beam sources and sinks are composite transforms; however, @@ -7105,6 +7109,10 @@ to annotate the <code>DoFn</code>.</p> <span class="k">def</span> <span class="nf">process</span><span class="p">(</span> <span class="bp">self</span><span class="p">,</span> <span class="n">file_name</span><span class="p">,</span> +<span class="c1"># Alternatively, we can let FileToWordsFn itself inherit from</span> +<span class="c1"># RestrictionProvider, implement the required methods and let</span> +<span class="c1"># tracker=beam.DoFn.RestrictionParam() which will use self as</span> +<span class="c1"># the provider.</span> <span class="n">tracker</span><span class="o">=</span><span class="n">beam</span><span class="o">.</span><span class="n">DoFn</span><span class="o">.</span><span class="n">RestrictionParam</span><span class="p">(</span><span class="n">FileToWordsRestrictionProvider</span><span class="p">())):</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file_name</span><span class="p">)</span> <span class="k">as</span> <span class="n">file_handle</span><span class="p">:</span> <span class="n">file_handle</span><span class="o">.</span><span class="n">seek</span><span class="p">(</span><span class="n">tracker</span><span class="o">.</span><span class="n">current_restriction</span><span class="o">.</span><span class="n">start</span><span class="p">())</span> diff --git a/website/generated-content/documentation/io/developing-io-java/index.html b/website/generated-content/documentation/io/developing-io-java/index.html index e6aacad..495dd2b 100644 --- a/website/generated-content/documentation/io/developing-io-java/index.html +++ b/website/generated-content/documentation/io/developing-io-java/index.html @@ -1,7 +1,8 @@ <!doctype html><html lang=en class=no-js><head><meta charset=utf-8><meta http-equiv=x-ua-compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1"><title>Apache Beam: Developing I/O connectors for Java</title><meta name=description content="Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration P [...] <span class=sr-only>Toggle navigation</span> <span class=icon-bar></span><span class=icon-bar></span><span class=icon-bar></span></button> -<a href=/ class=navbar-brand><img alt=Brand style=height:25px src=/images/beam_logo_navbar.png></a></div><div class="navbar-mask closed"></div><div id=navbar class="navbar-container closed"><ul class="nav navbar-nav"><li><a href=/get-started/beam-overview/>Get Started</a></li><li><a href=/documentation/>Documentation</a></li><li><a href=/documentation/sdks/java/>Languages</a></li><li><a href=/documentation/runners/capability-matrix/>RUNNERS</a></li><li><a href=/roadmap/>Roadmap</a></li>< [...] +<a href=/ class=navbar-brand><img alt=Brand style=height:25px src=/images/beam_logo_navbar.png></a></div><div class="navbar-mask closed"></div><div id=navbar class="navbar-container closed"><ul class="nav navbar-nav"><li><a href=/get-started/beam-overview/>Get Started</a></li><li><a href=/documentation/>Documentation</a></li><li><a href=/documentation/sdks/java/>Languages</a></li><li><a href=/documentation/runners/capability-matrix/>RUNNERS</a></li><li><a href=/roadmap/>Roadmap</a></li>< [...] +<a href=/documentation/io/developing-io-overview/>new I/O connector overview</a>.</p><p>To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector that usually consist of a source and a sink. All Beam sources and sinks are composite transforms; however, the implementation of your custom I/O depends on your use case. Before you diff --git a/website/generated-content/documentation/io/developing-io-overview/index.html b/website/generated-content/documentation/io/developing-io-overview/index.html index c507f58..651e577 100644 --- a/website/generated-content/documentation/io/developing-io-overview/index.html +++ b/website/generated-content/documentation/io/developing-io-overview/index.html @@ -1,7 +1,7 @@ <!doctype html><html lang=en class=no-js><head><meta charset=utf-8><meta http-equiv=x-ua-compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1"><title>Overview: Developing a new I/O connector</title><meta name=description content="Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns [...] <span class=sr-only>Toggle navigation</span> <span class=icon-bar></span><span class=icon-bar></span><span class=icon-bar></span></button> -<a href=/ class=navbar-brand><img alt=Brand style=height:25px src=/images/beam_logo_navbar.png></a></div><div class="navbar-mask closed"></div><div id=navbar class="navbar-container closed"><ul class="nav navbar-nav"><li><a href=/get-started/beam-overview/>Get Started</a></li><li><a href=/documentation/>Documentation</a></li><li><a href=/documentation/sdks/java/>Languages</a></li><li><a href=/documentation/runners/capability-matrix/>RUNNERS</a></li><li><a href=/roadmap/>Roadmap</a></li>< [...] +<a href=/ class=navbar-brand><img alt=Brand style=height:25px src=/images/beam_logo_navbar.png></a></div><div class="navbar-mask closed"></div><div id=navbar class="navbar-container closed"><ul class="nav navbar-nav"><li><a href=/get-started/beam-overview/>Get Started</a></li><li><a href=/documentation/>Documentation</a></li><li><a href=/documentation/sdks/java/>Languages</a></li><li><a href=/documentation/runners/capability-matrix/>RUNNERS</a></li><li><a href=/roadmap/>Roadmap</a></li>< [...] the <a href=/documentation/io/built-in/>Built-in I/O connectors</a></em></p><p>To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector. A connector usually consists of a source and a sink. All Beam sources and sinks are composite transforms; @@ -12,19 +12,18 @@ questions you might have. In addition, you can check if anyone else is working on the same I/O connector.</p></li><li><p>If you plan to contribute your I/O connector to the Beam community, see the <a href=/contribute/contribution-guide/>Apache Beam contribution guide</a>.</p></li><li><p>Read the <a href=/contribute/ptransform-style-guide/>PTransform style guide</a> for additional style guide recommendations.</p></li></ol><h2 id=sources>Sources</h2><p>For <strong>bounded (batch) sources</strong>, there are currently two options for creating a -Beam source:</p><ol><li><p>Use <code>ParDo</code> and <code>GroupByKey</code>.</p></li><li><p>Use the <code>Source</code> interface and extend the <code>BoundedSource</code> abstract subclass.</p></li></ol><p><code>ParDo</code> is the recommended option, as implementing a <code>Source</code> can be tricky. See -<a href=#when-to-use-source>When to use the Source interface</a> for a list of some use -cases where you might want to use a <code>Source</code> (such as -<a href=/blog/2016/05/18/splitAtFraction-method.html>dynamic work rebalancing</a>).</p><p>(Java only) For <strong>unbounded (streaming) sources</strong>, you must use the <code>Source</code> -interface and extend the <code>UnboundedSource</code> abstract subclass. <code>UnboundedSource</code> -supports features that are useful for streaming pipelines, such as -checkpointing.</p><p>Splittable DoFn is a new sources framework that is under development and will -replace the other options for developing bounded and unbounded sources. For more -information, see the -<a href=/roadmap/connectors-multi-sdk/>roadmap for multi-SDK connector efforts</a>.</p><h3 id=when-to-use-source>When to use the Source interface</h3><p>If you are not sure whether to use <code>Source</code>, feel free to email the <a href=/get-started/support>Beam dev -mailing list</a> and we can discuss the -specific pros and cons of your case.</p><p>In some cases, implementing a <code>Source</code> might be necessary or result in better -performance:</p><ul><li><p><strong>Unbounded sources:</strong> <code>ParDo</code> does not work for reading from unbounded +Beam source:</p><ol><li><p>Use <code>Splittable DoFn</code>.</p></li><li><p>Use <code>ParDo</code> and <code>GroupByKey</code>.</p></li></ol><p><code>Splittable DoFn</code> is the recommended option, as it’s the most recent source framework for both +bounded and unbounded sources. This is meant to replace the <code>Source</code> APIs( +<a href=https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html>BoundedSource</a> and +<a href=https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html>UnboundedSource</a>) +in the new system. Read +<a href=/learn/programming-guide/#splittable-dofns>Splittable DoFn Programming Guide</a> for how to write one +Splittable DoFn. For more information, see the +<a href=/roadmap/connectors-multi-sdk/>roadmap for multi-SDK connector efforts</a>.</p><p>For Java and Python <strong>unbounded (streaming) sources</strong>, you must use the <code>Splittable DoFn</code>, which +supports features that are useful for streaming pipelines, including checkpointing, controlling +watermark, and tracking backlog.</p><h3 id=when-to-use-splittable-dofn>When to use the Splittable DoFn interface</h3><p>If you are not sure whether to use <code>Splittable DoFn</code>, feel free to email the +<a href=/get-started/support>Beam dev mailing list</a> and we can discuss the specific pros and cons of your +case.</p><p>In some cases, implementing a <code>Splittable DoFn</code> might be necessary or result in better performance:</p><ul><li><p><strong>Unbounded sources:</strong> <code>ParDo</code> does not work for reading from unbounded sources. <code>ParDo</code> does not support checkpointing or mechanisms like de-duping that are useful for streaming data sources.</p></li><li><p><strong>Progress and size estimation:</strong> <code>ParDo</code> can’t provide hints to runners about progress or the size of data they are reading. Without size estimation of the @@ -34,14 +33,25 @@ allocate workers, it does not have any clues as to how many workers you might need for your pipeline.</p></li><li><p><strong>Dynamic work rebalancing:</strong> <code>ParDo</code> does not support dynamic work rebalancing, which is used by some readers to improve the processing speed of jobs. Depending on your data source, dynamic work rebalancing might not be -possible.</p></li><li><p><strong>Splitting into parts of particular size recommended by the runner:</strong> <code>ParDo</code> -does not receive <code>desired_bundle_size</code> as a hint from runners when performing -initial splitting.</p></li></ul><p>For example, if you’d like to read from a new file format that contains many +possible.</p></li><li><p><strong>Splitting initially to increase parallelism:</strong> <code>ParDo</code> +does not have the ability to perform initial splitting.</p></li></ul><p>For example, if you’d like to read from a new file format that contains many records per file, or if you’d like to read from a key-value store that supports -read operations in sorted key order.</p><h3 id=source>Source lifecycle</h3><p>Here is a sequence diagram that shows the lifecycle of the Source during -the execution of the Read transform of an IO. The comments give useful -information to IO developers such as the constraints that -apply to the objects or particular cases such as streaming mode.</p><p><img src=/images/source-sequence-diagram.svg alt="This is a sequence diagram that shows the lifecycle of the Source"></p><h3 id=using-pardo-and-groupbykey>Using ParDo and GroupByKey</h3><p>For data stores or file types where the data can be read in parallel, you can +read operations in sorted key order.</p><h3 id=io-examples-using-sdfs>I/O examples using SDFs</h3><p><strong>Java Examples</strong></p><ul><li><a href=https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118>Kafka</a>: +An I/O connector for <a href=https://kafka.apache.org/>Apache Kafka</a> +(an open-source distributed event streaming platform).</li><li><a href=https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787>Watch</a>: +Uses a polling function producing a growing set of outputs for each input until a per-input +termination condition is met.</li><li><a href=https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365>Parquet</a>: +An I/O connector for <a href=https://parquet.apache.org/>Apache Parquet</a> +(an open-source columnar storage format).</li><li><a href=https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493>HL7v2</a>: +An I/O connector for HL7v2 messages (a clinical messaging format that provides data about events +that occur inside an organization) part of +<a href=https://cloud.google.com/healthcare>Google’s Cloud Healthcare API</a>.</li><li><a href=https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248>BoundedSource wrapper</a>: +A wrapper which converts an existing <a href=https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html>BoundedSource</a> +implementation to a splittable DoFn.</li><li><a href=https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432>UnboundedSource wrapper</a>: +A wrapper which converts an existing <a href=https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html>UnboundedSource</a> +implementation to a splittable DoFn.</li></ul><p><strong>Python Examples</strong></p><ul><li><a href=https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375>BoundedSourceWrapper</a>: +A wrapper which converts an existing <a href=https://beam.apache.org/releases/pydoc/current/apache_beam.io.iobase.html#apache_beam.io.iobase.BoundedSource>BoundedSource</a> +implementation to a splittable DoFn.</li></ul><h3 id=using-pardo-and-groupbykey>Using ParDo and GroupByKey</h3><p>For data stores or file types where the data can be read in parallel, you can think of the process as a mini-pipeline. This often consists of two steps:</p><ol><li><p>Splitting the data into parts to be read in parallel</p></li><li><p>Reading from each of those parts</p></li></ol><p>Each of those steps will be a <code>ParDo</code>, with a <code>GroupByKey</code> in between. The <code>GroupByKey</code> is an implementation detail, but for most runners <code>GroupByKey</code> allows the runner to use different numbers of workers in some situations:</p><ul><li><p>Determining how to split up the data to be read into chunks</p></li><li><p>Reading data, which often benefits from more workers</p></li></ul><p>In addition, <code>GroupByKey</code> also allows dynamic work rebalancing to happen on @@ -64,7 +74,7 @@ received records to the data store. To develop more complex sinks (for example, to support data de-duplication when failures are retried by a runner), use <code>ParDo</code>, <code>GroupByKey</code>, and other available Beam transforms.</p><p>For <strong>file-based sinks</strong>, you can use the <code>FileBasedSink</code> abstraction that is provided by both the Java and Python SDKs. See our language specific -implementation guides for more details:</p><ul><li><a href=/documentation/io/developing-io-java/>Developing I/O connectors for Java</a></li><li><a href=/documentation/io/developing-io-python/>Developing I/O connectors for Python</a></li></ul></div></div><footer class=footer><div class=footer__contained><div class=footer__cols><div class=footer__cols__col><div class=footer__cols__col__logo><img src=/images/beam_logo_circle.svg class=footer__logo alt="Beam logo"></div><div class=footer__co [...] +implementation guides for more details:</p></div></div><footer class=footer><div class=footer__contained><div class=footer__cols><div class=footer__cols__col><div class=footer__cols__col__logo><img src=/images/beam_logo_circle.svg class=footer__logo alt="Beam logo"></div><div class=footer__cols__col__logo><img src=/images/apache_logo_circle.svg class=footer__logo alt="Apache logo"></div></div><div class="footer__cols__col footer__cols__col--md"><div class=footer__cols__col__title>Start</ [...] <a href=http://www.apache.org>The Apache Software Foundation</a> | <a href=/privacy_policy>Privacy Policy</a> | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></footer></body></html> \ No newline at end of file diff --git a/website/generated-content/documentation/io/developing-io-python/index.html b/website/generated-content/documentation/io/developing-io-python/index.html index 9121fbc..e43269d 100644 --- a/website/generated-content/documentation/io/developing-io-python/index.html +++ b/website/generated-content/documentation/io/developing-io-python/index.html @@ -1,7 +1,8 @@ <!doctype html><html lang=en class=no-js><head><meta charset=utf-8><meta http-equiv=x-ua-compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1"><title>Apache Beam: Developing I/O connectors for Python</title><meta name=description content="Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration [...] <span class=sr-only>Toggle navigation</span> <span class=icon-bar></span><span class=icon-bar></span><span class=icon-bar></span></button> -<a href=/ class=navbar-brand><img alt=Brand style=height:25px src=/images/beam_logo_navbar.png></a></div><div class="navbar-mask closed"></div><div id=navbar class="navbar-container closed"><ul class="nav navbar-nav"><li><a href=/get-started/beam-overview/>Get Started</a></li><li><a href=/documentation/>Documentation</a></li><li><a href=/documentation/sdks/java/>Languages</a></li><li><a href=/documentation/runners/capability-matrix/>RUNNERS</a></li><li><a href=/roadmap/>Roadmap</a></li>< [...] +<a href=/ class=navbar-brand><img alt=Brand style=height:25px src=/images/beam_logo_navbar.png></a></div><div class="navbar-mask closed"></div><div id=navbar class="navbar-container closed"><ul class="nav navbar-nav"><li><a href=/get-started/beam-overview/>Get Started</a></li><li><a href=/documentation/>Documentation</a></li><li><a href=/documentation/sdks/java/>Languages</a></li><li><a href=/documentation/runners/capability-matrix/>RUNNERS</a></li><li><a href=/roadmap/>Roadmap</a></li>< [...] +the <a href=/documentation/io/developing-io-overview/>new I/O connector overview</a>.</p><p>To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector that usually consist of a source and a sink. All Beam sources and sinks are composite transforms; however, the implementation of your custom I/O depends on your use case. Before you diff --git a/website/generated-content/documentation/programming-guide/index.html b/website/generated-content/documentation/programming-guide/index.html index edca61c..f469c08 100644 --- a/website/generated-content/documentation/programming-guide/index.html +++ b/website/generated-content/documentation/programming-guide/index.html @@ -2835,6 +2835,10 @@ to annotate the <code>DoFn</code>.</p><div class=language-java><div class=highli <span class=k>def</span> <span class=nf>process</span><span class=p>(</span> <span class=bp>self</span><span class=p>,</span> <span class=n>file_name</span><span class=p>,</span> + <span class=c1># Alternatively, we can let FileToWordsFn itself inherit from</span> + <span class=c1># RestrictionProvider, implement the required methods and let</span> + <span class=c1># tracker=beam.DoFn.RestrictionParam() which will use self as</span> + <span class=c1># the provider.</span> <span class=n>tracker</span><span class=o>=</span><span class=n>beam</span><span class=o>.</span><span class=n>DoFn</span><span class=o>.</span><span class=n>RestrictionParam</span><span class=p>(</span><span class=n>FileToWordsRestrictionProvider</span><span class=p>())):</span> <span class=k>with</span> <span class=nb>open</span><span class=p>(</span><span class=n>file_name</span><span class=p>)</span> <span class=k>as</span> <span class=n>file_handle</span><span class=p>:</span> <span class=n>file_handle</span><span class=o>.</span><span class=n>seek</span><span class=p>(</span><span class=n>tracker</span><span class=o>.</span><span class=n>current_restriction</span><span class=o>.</span><span class=n>start</span><span class=p>())</span> diff --git a/website/generated-content/sitemap.xml b/website/generated-content/sitemap.xml index 9662fda..4be5e29 100644 --- a/website/generated-content/sitemap.xml +++ b/website/generated-content/sitemap.xml @@ -1 +1 @@ -<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.25.0/</loc><lastmod>2020-10-29T14:08:19-07:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2020-10-29T14:08:19-07:00</lastmod></url><url><loc>/blog/</loc><lastmod>2020-10-29T14:08:19-07:00</lastmod></url><url><loc>/categories/</loc><lastmod>2020-10-29T14:08:19-07:00</lastmod></url><url><loc>/blog/b [...] \ No newline at end of file +<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.25.0/</loc><lastmod>2020-10-29T14:08:19-07:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2020-10-29T14:08:19-07:00</lastmod></url><url><loc>/blog/</loc><lastmod>2020-10-29T14:08:19-07:00</lastmod></url><url><loc>/categories/</loc><lastmod>2020-10-29T14:08:19-07:00</lastmod></url><url><loc>/blog/b [...] \ No newline at end of file