This is an automated email from the ASF dual-hosted git repository. git-site-role pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/asf-site by this push: new efdbfba Publishing website 2021/12/03 18:03:23 at commit 862ece1 efdbfba is described below commit efdbfbab9e1221d92319bbc4728d12382197ac73 Author: jenkins <bui...@apache.org> AuthorDate: Fri Dec 3 18:03:24 2021 +0000 Publishing website 2021/12/03 18:03:23 at commit 862ece1 --- website/generated-content/documentation/index.xml | 39 +++++++++++++- .../index.html | 2 +- .../runners/capability-matrix/index.html | 2 +- .../runners/capability-matrix/index.xml | 62 ++++++++++++++++++++++ .../documentation/runtime/model/index.html | 29 ++++++++-- website/generated-content/sitemap.xml | 2 +- 6 files changed, 128 insertions(+), 8 deletions(-) diff --git a/website/generated-content/documentation/index.xml b/website/generated-content/documentation/index.xml index 7289a86..fc279a0 100644 --- a/website/generated-content/documentation/index.xml +++ b/website/generated-content/documentation/index.xml @@ -15395,7 +15395,9 @@ serializing the elements and broadcasting them to all the workers executing the <code>ParDo</code>.</li> <li>Passing elements between transforms that are running on the same worker. This may allow the runner to avoid serializing elements; instead, the runner -can just pass the elements in memory.</li> +can just pass the elements in memory. This is done as part of an +optimization that is known as +<a href="https://beam.apache.org/documentation/glossary/#fusion">fusion</a>.</li> </ul> <p>Some situations where the runner may serialize and persist elements are:</p> <ol> @@ -15420,6 +15422,41 @@ choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure. For example, a streaming runner may prefer to process and commit small bundles, and a batch runner may prefer to process larger bundles.</p> +<h3 id="data-partitioning-and-inter-stage-execution">Data partitioning and inter-stage execution</h3> +<p>Partitioning and parallelization of element processing within a Beam pipeline is +dependent on two things:</p> +<ul> +<li>Data source implementation</li> +<li>Inter-stage key parallelism</li> +</ul> +<p>Beam pipelines read data from a source (e.g. <code>KafkaIO</code>, <code>BigQueryIO</code>, <code>JdbcIO</code>, +or your own source implementation). To implement a Source in Beam one must +implement it as a Splittable <code>DoFn</code>. A Splittable <code>DoFn</code> provides the runner +with interfaces to facilitate the splitting of work.</p> +<p>When running key-based operations in Beam (e.g. <code>GroupByKey</code>, <code>Combine</code>, +<code>Reshuffle.perKey</code>, and stateful <code>DoFn</code>s), Beam runners perform serialization +and transfer of data known as <em>shuffle</em><sup>1</sup>. Shuffle allows data +elements of the same key to be processed together.</p> +<p>The way in which runners <em>shuffle</em> data may be slightly different for Batch and +Streaming execution modes.</p> +<p><sup>1</sup>Not to be confused with the <code>shuffle</code> operation in some runners.</p> +<h4 id="data-ordering-in-a-pipeline-execution">Data ordering in a pipeline execution</h4> +<p>The Beam model does not define strict guidelines regarding the order in which +runners process elements or transport them across <code>PTransforms</code>. Runners are +free to implement data transfer semantics in different forms.</p> +<p>Some use cases exist where user pipelines may need to rely on specific ordering +semantics in pipeline execution. The <a href="/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html">capability matrix documents</a> +runner behavior for <strong>key-ordered delivery</strong>.</p> +<p>Consider a single Beam worker processing a series of bundles from the same Beam +transform, and consider a <code>PTransform</code> that outputs data from this Stage into a +downstream <code>PCollection</code>. Finally, consider two events <em>with the same key</em> +emitted in a certain order by this worker (within the same bundle or as part of +different bundles).</p> +<p>We say that the Beam runner supports <strong>key-ordered delivery</strong> if it guarantees +that these two events will be observed in the same order by a PTransform that is +immediately downstream independently of the kind of data transmission method.</p> +<p>This characteristic will hold true in runners and operations that have +key-limited parallelism.</p> <h2 id="parallelism">Failures and parallelism within and between transforms</h2> <p>In this section, we discuss how elements in the input collection are processed in parallel, and how transforms are retried when failures occur.</p> diff --git a/website/generated-content/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html b/website/generated-content/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html index f3ef192..4082d17 100644 --- a/website/generated-content/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html +++ b/website/generated-content/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html @@ -18,7 +18,7 @@ function addPlaceholder(){$('input:text').attr('placeholder',"What are you looking for?");} function endSearch(){var search=document.querySelector(".searchBar");search.classList.add("disappear");var icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");} function blockScroll(){$("body").toggleClass("fixedPosition");} -function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><a class=back-button href=/documentation/runners/capability-matrix><i class="fas fa-arrow-left"></i>back to collapsed details</a><h4>Additional common features not yet part of the Beam model</h4><div class=table-container><div class="table-left big-left"><table><tr><th></th></tr><tr><th>Drain</th></tr><tr><th>Checkpoint</th></tr></table></div><div class="table-right big-right"><div i [...] +function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><a class=back-button href=/documentation/runners/capability-matrix><i class="fas fa-arrow-left"></i>back to collapsed details</a><h4>Additional common features not yet part of the Beam model</h4><div class=table-container><div class="table-left big-left"><table><tr><th></th></tr><tr><th>Drain</th></tr><tr><th>Checkpoint</th></tr><tr><th>Key-ordered delivery</th></tr></table></div><di [...] <a href=http://www.apache.org>The Apache Software Foundation</a> | <a href=/privacy_policy>Privacy Policy</a> | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></div></div></div></footer></body></html> \ No newline at end of file diff --git a/website/generated-content/documentation/runners/capability-matrix/index.html b/website/generated-content/documentation/runners/capability-matrix/index.html index 057fe9a..245259b 100644 --- a/website/generated-content/documentation/runners/capability-matrix/index.html +++ b/website/generated-content/documentation/runners/capability-matrix/index.html @@ -18,7 +18,7 @@ function addPlaceholder(){$('input:text').attr('placeholder',"What are you looking for?");} function endSearch(){var search=document.querySelector(".searchBar");search.classList.add("disappear");var icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");} function blockScroll(){$("body").toggleClass("fixedPosition");} -function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Runners</span></li><li><a href=/documentation/runners/capability-matrix/>Capability Matrix</a></li><li><a href=/documentation/runners/direct/>Direct Ru [...] +function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Runners</span></li><li><a href=/documentation/runners/capability-matrix/>Capability Matrix</a></li><li><a href=/documentation/runners/direct/>Direct Ru [...] <script>$('.table-headers').scroll(function(e){$('#'+this.id+'.table-center').scrollLeft($(this).scrollLeft());});$('.table-center').scroll(function(e){$('#'+this.id+'.table-headers').scrollLeft($(this).scrollLeft());});</script><div class=feedback><p class=update>Last updated on 2021/02/05</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all useful and clear? Is there anything that you would like to change? Let us know!</p><button class=load-button> [...] <a href=http://www.apache.org>The Apache Software Foundation</a> | <a href=/privacy_policy>Privacy Policy</a> diff --git a/website/generated-content/documentation/runners/capability-matrix/index.xml b/website/generated-content/documentation/runners/capability-matrix/index.xml index 8ac8d1b..88ede27 100644 --- a/website/generated-content/documentation/runners/capability-matrix/index.xml +++ b/website/generated-content/documentation/runners/capability-matrix/index.xml @@ -28,6 +28,9 @@ back to collapsed details <tr> <th>Checkpoint</th> </tr> +<tr> +<th>Key-ordered delivery</th> +</tr> </table> </div> <div class="table-right big-right"> @@ -155,6 +158,65 @@ Samza has a native checkpoint capability. <br> </td> </tr> +<tr> +<td style='background-color:#f9f9f9;border-color:#d8d8d8'> +<b> +<p>Partially : </p> +</b> +<br> +Dataflow performs different shuffling algorithms for batch and streaming. Dataflow guarantees key-ordered delivery in streaming, though not in batch. +</td> +<td style='background-color:#f9f9f9;border-color:#d8d8d8'> +<b> +<p>Partially : </p> +</b> +<br> +Flink may perform different shuffling algorithms for batch and streaming. Flink guarantees key-ordered delivery in streaming, though not in batch. +</td> +<td style='background-color:#e1e0e0;border-color:#bcbcbc'> +<b> +<p>Unverified : </p> +</b> +<br> +</td> +<td style='background-color:#e1e0e0;border-color:#bcbcbc'> +<b> +<p>Unverified : </p> +</b> +<br> +</td> +<td style='background-color:#e1e0e0;border-color:#bcbcbc'> +<b> +<p>Unverified : </p> +</b> +<br> +</td> +<td style='background-color:#f9f9f9;border-color:#d8d8d8'> +<b> +<p>Partially : </p> +</b> +<br> +Samza may perform different shuffling algorithms for batch and streaming. Samza guarantees key-ordered delivery in streaming, though not in batch. +</td> +<td style='background-color:#e1e0e0;border-color:#bcbcbc'> +<b> +<p>Unverified : </p> +</b> +<br> +</td> +<td style='background-color:#e1e0e0;border-color:#bcbcbc'> +<b> +<p>Unverified : </p> +</b> +<br> +</td> +<td style='background-color:#e1e0e0;border-color:#bcbcbc'> +<b> +<p>Unverified : </p> +</b> +<br> +</td> +</tr> </table> </div> </div> diff --git a/website/generated-content/documentation/runtime/model/index.html b/website/generated-content/documentation/runtime/model/index.html index f612cc3..2a483c4 100644 --- a/website/generated-content/documentation/runtime/model/index.html +++ b/website/generated-content/documentation/runtime/model/index.html @@ -18,7 +18,7 @@ function addPlaceholder(){$('input:text').attr('placeholder',"What are you looking for?");} function endSearch(){var search=document.querySelector(".searchBar");search.classList.add("disappear");var icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");} function blockScroll(){$("body").toggleClass("fixedPosition");} -function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...] +function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...] may observe various effects as a result of the runner’s choices. This page describes these effects so you can better understand how Beam pipelines execute.</p><h2 id=processing-of-elements>Processing of elements</h2><p>The serialization and communication of elements between machines is one of the most expensive operations in a distributed execution of your pipeline. Avoiding @@ -32,7 +32,9 @@ involve serializing elements and communicating them to other workers.</li><li>Us serializing the elements and broadcasting them to all the workers executing the <code>ParDo</code>.</li><li>Passing elements between transforms that are running on the same worker. This may allow the runner to avoid serializing elements; instead, the runner -can just pass the elements in memory.</li></ul><p>Some situations where the runner may serialize and persist elements are:</p><ol><li>When used as part of a stateful <code>DoFn</code>, the runner may persist values to some +can just pass the elements in memory. This is done as part of an +optimization that is known as +<a href=https://beam.apache.org/documentation/glossary/#fusion>fusion</a>.</li></ul><p>Some situations where the runner may serialize and persist elements are:</p><ol><li>When used as part of a stateful <code>DoFn</code>, the runner may persist values to some state mechanism.</li><li>When committing the results of processing, the runner may persist the outputs as a checkpoint.</li></ol><h3 id=bundling-and-persistence>Bundling and persistence</h3><p>Beam pipelines often focus on “<a href=https://en.wikipedia.org/wiki/embarrassingly_parallel>embarassingly parallel</a>” problems. Because of this, the APIs emphasize processing elements in parallel, @@ -46,7 +48,26 @@ bundles is arbitrary and selected by the runner. This allows the runner to choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure. For example, a streaming runner may prefer to process and commit small bundles, and a batch -runner may prefer to process larger bundles.</p><h2 id=parallelism>Failures and parallelism within and between transforms</h2><p>In this section, we discuss how elements in the input collection are processed +runner may prefer to process larger bundles.</p><h3 id=data-partitioning-and-inter-stage-execution>Data partitioning and inter-stage execution</h3><p>Partitioning and parallelization of element processing within a Beam pipeline is +dependent on two things:</p><ul><li>Data source implementation</li><li>Inter-stage key parallelism</li></ul><p>Beam pipelines read data from a source (e.g. <code>KafkaIO</code>, <code>BigQueryIO</code>, <code>JdbcIO</code>, +or your own source implementation). To implement a Source in Beam one must +implement it as a Splittable <code>DoFn</code>. A Splittable <code>DoFn</code> provides the runner +with interfaces to facilitate the splitting of work.</p><p>When running key-based operations in Beam (e.g. <code>GroupByKey</code>, <code>Combine</code>, +<code>Reshuffle.perKey</code>, and stateful <code>DoFn</code>s), Beam runners perform serialization +and transfer of data known as <em>shuffle</em><sup>1</sup>. Shuffle allows data +elements of the same key to be processed together.</p><p>The way in which runners <em>shuffle</em> data may be slightly different for Batch and +Streaming execution modes.</p><p><sup>1</sup>Not to be confused with the <code>shuffle</code> operation in some runners.</p><h4 id=data-ordering-in-a-pipeline-execution>Data ordering in a pipeline execution</h4><p>The Beam model does not define strict guidelines regarding the order in which +runners process elements or transport them across <code>PTransforms</code>. Runners are +free to implement data transfer semantics in different forms.</p><p>Some use cases exist where user pipelines may need to rely on specific ordering +semantics in pipeline execution. The <a href=/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/index.html>capability matrix documents</a> +runner behavior for <strong>key-ordered delivery</strong>.</p><p>Consider a single Beam worker processing a series of bundles from the same Beam +transform, and consider a <code>PTransform</code> that outputs data from this Stage into a +downstream <code>PCollection</code>. Finally, consider two events <em>with the same key</em> +emitted in a certain order by this worker (within the same bundle or as part of +different bundles).</p><p>We say that the Beam runner supports <strong>key-ordered delivery</strong> if it guarantees +that these two events will be observed in the same order by a PTransform that is +immediately downstream independently of the kind of data transmission method.</p><p>This characteristic will hold true in runners and operations that have +key-limited parallelism.</p><h2 id=parallelism>Failures and parallelism within and between transforms</h2><p>In this section, we discuss how elements in the input collection are processed in parallel, and how transforms are retried when failures occur.</p><h3 id=data-parallelism>Data-parallelism within one transform</h3><p>When executing a single <code>ParDo</code>, a runner might divide an example input collection of nine elements into two bundles as shown in figure 1.</p><p><img src=/images/execution_model_bundling.svg alt="Bundle A contains five elements. Bundle B contains four elements."></p><p><em>Figure 1: A runner divides an input collection into two bundles.</em></p><p>When the <code>ParDo</code> executes, workers may process the two bundles in parallel as shown in figure 2.</p><p><img src=/images/execution_model_bundling_gantt.svg alt="Two workers process the two bundles in parallel. Worker one processes bundle A. Worker two processes bundle B."></p><p><em>Figure 2: Two workers process the two bundles in parallel.</em></p><p>Since elements cannot be split, the maximum parallelism for a transform depends @@ -87,7 +108,7 @@ elements in the input bundle must be retried. These two <code>ParDo</code>s are the input bundle are retried.</em></p><p>Note that the retry does not necessarily have the same processing time as the original attempt, as shown in the diagram.</p><p>All <code>DoFns</code> that experience coupled failures are terminated and must be torn down since they aren’t following the normal <code>DoFn</code> lifecycle .</p><p>Executing transforms this way allows a runner to avoid persisting elements -between transforms, saving on persistence costs.</p><div class=feedback><p class=update>Last updated on 2020/08/28</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all useful and clear? Is there anything that you would like to change? Let us know!</p><button class=load-button><a href="mailto:d...@beam.apache.org?subject=Beam Website Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div class=footer__contained><div class=foote [...] +between transforms, saving on persistence costs.</p><div class=feedback><p class=update>Last updated on 2021/12/03</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all useful and clear? Is there anything that you would like to change? Let us know!</p><button class=load-button><a href="mailto:d...@beam.apache.org?subject=Beam Website Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div class=footer__contained><div class=foote [...] <a href=http://www.apache.org>The Apache Software Foundation</a> | <a href=/privacy_policy>Privacy Policy</a> | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></div></div></div></footer></body></html> \ No newline at end of file diff --git a/website/generated-content/sitemap.xml b/website/generated-content/sitemap.xml index 424aa32..313115d 100644 --- a/website/generated-content/sitemap.xml +++ b/website/generated-content/sitemap.xml @@ -1 +1 @@ -<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.34.0/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-12-01T21:32:04+03:00</lastmod></url><url><loc>/blog/g [...] \ No newline at end of file +<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.34.0/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-12-01T21:32:04+03:00</lastmod></url><url><loc>/blog/g [...] \ No newline at end of file