This is an automated email from the ASF dual-hosted git repository. git-site-role pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/asf-site by this push: new c4f1f2b Publishing website 2021/11/30 12:01:48 at commit f1010d1 c4f1f2b is described below commit c4f1f2b4a91631cae398e4e1cb743f8b6c266dc6 Author: jenkins <bui...@apache.org> AuthorDate: Tue Nov 30 12:01:48 2021 +0000 Publishing website 2021/11/30 12:01:48 at commit f1010d1 --- .../documentation/basics/index.html | 65 +++++++++++- website/generated-content/documentation/index.xml | 111 +++++++++++++++++++++ website/generated-content/sitemap.xml | 2 +- 3 files changed, 173 insertions(+), 5 deletions(-) diff --git a/website/generated-content/documentation/basics/index.html b/website/generated-content/documentation/basics/index.html index 8dc0acd..7229577 100644 --- a/website/generated-content/documentation/basics/index.html +++ b/website/generated-content/documentation/basics/index.html @@ -18,7 +18,7 @@ function addPlaceholder(){$('input:text').attr('placeholder',"What are you looking for?");} function endSearch(){var search=document.querySelector(".searchBar");search.classList.add("disappear");var icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");} function blockScroll(){$("body").toggleClass("fixedPosition");} -function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...] +function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...] data-parallel processing pipelines. To get started with Beam, you’ll need to understand an important set of core concepts:</p><ul><li><a href=#pipeline><em>Pipeline</em></a> - A pipeline is a user-constructed graph of transformations that defines the desired data processing operations.</li><li><a href=#pcollection><em>PCollection</em></a> - A <code>PCollection</code> is a data set or data @@ -33,7 +33,10 @@ a <code>PCollection</code>. The schema for a <code>PCollection</code> defines el <code>PCollection</code> as an ordered list of named fields.</li><li><a href=/documentation/sdks/java/><em>SDK</em></a> - A language-specific library that lets pipeline authors build transforms, construct their pipelines, and submit them to a runner.</li><li><a href=#runner><em>Runner</em></a> - A runner runs a Beam pipeline using the capabilities of -your chosen data processing engine.</li><li><a href=#splittable-dofn><em>Splittable DoFn</em></a> - Splittable DoFns let you process +your chosen data processing engine.</li><li><a href=#trigger><em>Trigger</em></a> - A trigger determines when to aggregate the results of +each window.</li><li><a href=#state-and-timers><em>State and timers</em></a> - Per-key state and timer callbacks +are lower level primitives that give you full control over aggregating input +collections that grow over time.</li><li><a href=#splittable-dofn><em>Splittable DoFn</em></a> - Splittable DoFns let you process elements in a non-monolithic way. You can checkpoint the processing of an element, and the runner can split the remaining work to yield additional parallelism.</li></ul><p>The following sections cover these concepts in more detail and provide links to @@ -185,7 +188,61 @@ Flink runner translates a Beam pipeline into a Flink job. The Direct Runner runs pipelines locally so you can test, debug, and validate that your pipeline adheres to the Apache Beam model as closely as possible.</p><p>For an up-to-date list of Beam runners and which features of the Apache Beam model they support, see the runner -<a href=/documentation/runners/capability-matrix/>capability matrix</a>.</p><p>For more information about runners, see the following pages:</p><ul><li><a href=/documentation/#choosing-a-runner>Choosing a Runner</a></li><li><a href=/documentation/runners/capability-matrix/>Beam Capability Matrix</a></li></ul><h3 id=splittable-dofn>Splittable DoFn</h3><p>Splittable <code>DoFn</code> (SDF) is a generalization of <code>DoFn</code> that lets you process +<a href=/documentation/runners/capability-matrix/>capability matrix</a>.</p><p>For more information about runners, see the following pages:</p><ul><li><a href=/documentation/#choosing-a-runner>Choosing a Runner</a></li><li><a href=/documentation/runners/capability-matrix/>Beam Capability Matrix</a></li></ul><h3 id=trigger>Trigger</h3><p>When collecting and grouping data into windows, Beam uses <em>triggers</em> to +determine when to emit the aggregated results of each window (referred to as a +<em>pane</em>). If you use Beam’s default windowing configuration and default trigger, +Beam outputs the aggregated result when it estimates all data has arrived, and +discards all subsequent data for that window.</p><p>At a high level, triggers provide two additional capabilities compared to +outputting at the end of a window:</p><ol><li>Triggers allow Beam to emit early results, before all the data in a given +window has arrived. For example, emitting after a certain amount of time +elapses, or after a certain number of elements arrives.</li><li>Triggers allow processing of late data by triggering after the event time +watermark passes the end of the window.</li></ol><p>These capabilities allow you to control the flow of your data and also balance +between data completeness, latency, and cost.</p><p>Beam provides a number of pre-built triggers that you can set:</p><ul><li><strong>Event time triggers</strong>: These triggers operate on the event time, as +indicated by the timestamp on each data element. Beam’s default trigger is +event time-based.</li><li><strong>Processing time triggers</strong>: These triggers operate on the processing time, +which is the time when the data element is processed at any given stage in +the pipeline.</li><li><strong>Data-driven triggers</strong>: These triggers operate by examining the data as it +arrives in each window, and firing when that data meets a certain property. +Currently, data-driven triggers only support firing after a certain number of +data elements.</li><li><strong>Composite triggers</strong>: These triggers combine multiple triggers in various +ways. For example, you might want one trigger for early data and a different +trigger for late data.</li></ul><p>For more information about triggers, see the following page:</p><ul><li><a href=/documentation/programming-guide/#triggers>Beam Programming Guide: Triggers</a></li></ul><h3 id=state-and-timers>State and timers</h3><p>Beam’s windowing and triggers provide an abstraction for grouping and +aggregating unbounded input data based on timestamps. However, there are +aggregation use cases that might require an even higher degree of control. State +and timers are two important concepts that help with these uses cases. Like +other aggregations, state and timers are processed per window.</p><p><strong>State</strong>:</p><p>Beam provides the State API for manually managing per-key state, allowing for +fine-grained control over aggregations. The State API lets you augment +element-wise operations (for example, <code>ParDo</code> or <code>Map</code>) with mutable state. Like +other aggregations, state is processed per window.</p><p>The State API models state per key. To use the state API, you start out with a +keyed <code>PCollection</code>. A <code>ParDo</code> that processes this <code>PCollection</code> can declare +persistent state variables. When you process each element inside the <code>ParDo</code>, +you can use the state variables to write or update state for the current key or +to read previous state written for that key. State is always fully scoped only +to the current processing key.</p><p>Beam provides several types of state, though different runners might support a +different subset of these states.</p><ul><li><strong>ValueState</strong>: ValueState is a scalar state value. For each key in the +input, a ValueState stores a typed value that can be read and modified inside +the <code>DoFn</code>.</li><li>A common use case for state is to accumulate multiple elements into a group:<ul><li><strong>BagState</strong>: BagState allows you to accumulate elements in an unordered +bag. This lets you add elements to a collection without needing to read any +of the previously accumulated elements.</li><li><strong>MapState</strong>: MapState allows you to accumulate elements in a map.</li><li><strong>SetState</strong>: SetState allows you to accumulate elements in a set.</li><li><strong>OrderedListState</strong>: OrderedListState allows you to accumulate elements in +a timestamp-sorted list.</li></ul></li><li><strong>CombiningState</strong>: CombiningState allows you to create a state object that +is updated using a Beam combiner. Like BagState, you can add elements to an +aggregation without needing to read the current value, and the accumulator +can be compacted using a combiner.</li></ul><p>You can use the State API together with the Timer API to create processing tasks +that give you fine-grained control over the workflow.</p><p><strong>Timers</strong>:</p><p>Beam provides a per-key timer callback API that enables delayed processing of +data stored using the State API. The Timer API lets you set timers to call back +at either an event-time or a processing-time timestamp. For more advanced use +cases, your timer callback can set another timer. Like other aggregations, +timers are processed per window. You can use the timer API together with the +State API to create processing tasks that give you fine-grained control over the +workflow.</p><p>The following timers are available:</p><ul><li><strong>Event-time timers</strong>: Event-time timers fire when the input watermark for +the <code>DoFn</code> passes the time at which the timer is set, meaning that the runner +believes that there are no more elements to be processed with timestamps +before the timer timestamp. This allows for event-time aggregations.</li><li><strong>Processing-time timers</strong>: Processing-time timers fire when the real wall-clock +time passes. This is often used to create larger batches of data before +processing. It can also be used to schedule events that should occur at a +specific time.</li><li><strong>Dynamic timer tags</strong>: Beam also supports dynamically setting a timer tag. This +allows you to set multiple different timers in a <code>DoFn</code> and dynamically +choose timer tags (for example, based on data in the input elements).</li></ul><p>For more information about state and timers, see the following pages:</p><ul><li><a href=/documentation/programming-guide/#state-and-timers>Beam Programming Guide: State and Timers</a></li><li><a href=/blog/stateful-processing/>Stateful processing with Apache Beam</a></li><li><a href=/blog/timely-processing/>Timely (and Stateful) Processing with Apache Beam</a></li></ul><h3 id=splittable-dofn>Splittable DoF [...] elements in a non-monolithic way. Splittable <code>DoFn</code> makes it easier to create complex, modular I/O connectors in Beam.</p><p>A regular <code>ParDo</code> processes an entire element at a time, applying your regular <code>DoFn</code> and waiting for the call to terminate. When you instead apply a @@ -202,7 +259,7 @@ checkpoint the sub-element and the runner repeats step 2.</li></ol><p>You can al processing. For example, if you write a splittable <code>DoFn</code> to watch a set of directories and output filenames as they arrive, you can split to subdivide the work of different directories. This allows the runner to split off a hot -directory and give it additional resources.</p><p>For more information about Splittable <code>DoFn</code>, see the following pages:</p><ul><li><a href=/documentation/programming-guide/#splittable-dofns>Splittable DoFns</a></li><li><a href=/blog/splittable-do-fn-is-available/>Splittable DoFn in Apache Beam is Ready to Use</a></li></ul><div class=feedback><p class=update>Last updated on 2021/10/26</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all us [...] +directory and give it additional resources.</p><p>For more information about Splittable <code>DoFn</code>, see the following pages:</p><ul><li><a href=/documentation/programming-guide/#splittable-dofns>Splittable DoFns</a></li><li><a href=/blog/splittable-do-fn-is-available/>Splittable DoFn in Apache Beam is Ready to Use</a></li></ul><div class=feedback><p class=update>Last updated on 2021/10/21</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all us [...] <a href=http://www.apache.org>The Apache Software Foundation</a> | <a href=/privacy_policy>Privacy Policy</a> | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></div></div></div></footer></body></html> \ No newline at end of file diff --git a/website/generated-content/documentation/index.xml b/website/generated-content/documentation/index.xml index 63d9d83..7289a86 100644 --- a/website/generated-content/documentation/index.xml +++ b/website/generated-content/documentation/index.xml @@ -3205,6 +3205,11 @@ pipeline authors build transforms, construct their pipelines, and submit them to a runner.</li> <li><a href="#runner"><em>Runner</em></a> - A runner runs a Beam pipeline using the capabilities of your chosen data processing engine.</li> +<li><a href="#trigger"><em>Trigger</em></a> - A trigger determines when to aggregate the results of +each window.</li> +<li><a href="#state-and-timers"><em>State and timers</em></a> - Per-key state and timer callbacks +are lower level primitives that give you full control over aggregating input +collections that grow over time.</li> <li><a href="#splittable-dofn"><em>Splittable DoFn</em></a> - Splittable DoFns let you process elements in a non-monolithic way. You can checkpoint the processing of an element, and the runner can split the remaining work to yield additional @@ -3477,6 +3482,112 @@ model they support, see the runner <li><a href="/documentation/#choosing-a-runner">Choosing a Runner</a></li> <li><a href="/documentation/runners/capability-matrix/">Beam Capability Matrix</a></li> </ul> +<h3 id="trigger">Trigger</h3> +<p>When collecting and grouping data into windows, Beam uses <em>triggers</em> to +determine when to emit the aggregated results of each window (referred to as a +<em>pane</em>). If you use Beam’s default windowing configuration and default trigger, +Beam outputs the aggregated result when it estimates all data has arrived, and +discards all subsequent data for that window.</p> +<p>At a high level, triggers provide two additional capabilities compared to +outputting at the end of a window:</p> +<ol> +<li>Triggers allow Beam to emit early results, before all the data in a given +window has arrived. For example, emitting after a certain amount of time +elapses, or after a certain number of elements arrives.</li> +<li>Triggers allow processing of late data by triggering after the event time +watermark passes the end of the window.</li> +</ol> +<p>These capabilities allow you to control the flow of your data and also balance +between data completeness, latency, and cost.</p> +<p>Beam provides a number of pre-built triggers that you can set:</p> +<ul> +<li><strong>Event time triggers</strong>: These triggers operate on the event time, as +indicated by the timestamp on each data element. Beam’s default trigger is +event time-based.</li> +<li><strong>Processing time triggers</strong>: These triggers operate on the processing time, +which is the time when the data element is processed at any given stage in +the pipeline.</li> +<li><strong>Data-driven triggers</strong>: These triggers operate by examining the data as it +arrives in each window, and firing when that data meets a certain property. +Currently, data-driven triggers only support firing after a certain number of +data elements.</li> +<li><strong>Composite triggers</strong>: These triggers combine multiple triggers in various +ways. For example, you might want one trigger for early data and a different +trigger for late data.</li> +</ul> +<p>For more information about triggers, see the following page:</p> +<ul> +<li><a href="/documentation/programming-guide/#triggers">Beam Programming Guide: Triggers</a></li> +</ul> +<h3 id="state-and-timers">State and timers</h3> +<p>Beam’s windowing and triggers provide an abstraction for grouping and +aggregating unbounded input data based on timestamps. However, there are +aggregation use cases that might require an even higher degree of control. State +and timers are two important concepts that help with these uses cases. Like +other aggregations, state and timers are processed per window.</p> +<p><strong>State</strong>:</p> +<p>Beam provides the State API for manually managing per-key state, allowing for +fine-grained control over aggregations. The State API lets you augment +element-wise operations (for example, <code>ParDo</code> or <code>Map</code>) with mutable state. Like +other aggregations, state is processed per window.</p> +<p>The State API models state per key. To use the state API, you start out with a +keyed <code>PCollection</code>. A <code>ParDo</code> that processes this <code>PCollection</code> can declare +persistent state variables. When you process each element inside the <code>ParDo</code>, +you can use the state variables to write or update state for the current key or +to read previous state written for that key. State is always fully scoped only +to the current processing key.</p> +<p>Beam provides several types of state, though different runners might support a +different subset of these states.</p> +<ul> +<li><strong>ValueState</strong>: ValueState is a scalar state value. For each key in the +input, a ValueState stores a typed value that can be read and modified inside +the <code>DoFn</code>.</li> +<li>A common use case for state is to accumulate multiple elements into a group: +<ul> +<li><strong>BagState</strong>: BagState allows you to accumulate elements in an unordered +bag. This lets you add elements to a collection without needing to read any +of the previously accumulated elements.</li> +<li><strong>MapState</strong>: MapState allows you to accumulate elements in a map.</li> +<li><strong>SetState</strong>: SetState allows you to accumulate elements in a set.</li> +<li><strong>OrderedListState</strong>: OrderedListState allows you to accumulate elements in +a timestamp-sorted list.</li> +</ul> +</li> +<li><strong>CombiningState</strong>: CombiningState allows you to create a state object that +is updated using a Beam combiner. Like BagState, you can add elements to an +aggregation without needing to read the current value, and the accumulator +can be compacted using a combiner.</li> +</ul> +<p>You can use the State API together with the Timer API to create processing tasks +that give you fine-grained control over the workflow.</p> +<p><strong>Timers</strong>:</p> +<p>Beam provides a per-key timer callback API that enables delayed processing of +data stored using the State API. The Timer API lets you set timers to call back +at either an event-time or a processing-time timestamp. For more advanced use +cases, your timer callback can set another timer. Like other aggregations, +timers are processed per window. You can use the timer API together with the +State API to create processing tasks that give you fine-grained control over the +workflow.</p> +<p>The following timers are available:</p> +<ul> +<li><strong>Event-time timers</strong>: Event-time timers fire when the input watermark for +the <code>DoFn</code> passes the time at which the timer is set, meaning that the runner +believes that there are no more elements to be processed with timestamps +before the timer timestamp. This allows for event-time aggregations.</li> +<li><strong>Processing-time timers</strong>: Processing-time timers fire when the real wall-clock +time passes. This is often used to create larger batches of data before +processing. It can also be used to schedule events that should occur at a +specific time.</li> +<li><strong>Dynamic timer tags</strong>: Beam also supports dynamically setting a timer tag. This +allows you to set multiple different timers in a <code>DoFn</code> and dynamically +choose timer tags (for example, based on data in the input elements).</li> +</ul> +<p>For more information about state and timers, see the following pages:</p> +<ul> +<li><a href="/documentation/programming-guide/#state-and-timers">Beam Programming Guide: State and Timers</a></li> +<li><a href="/blog/stateful-processing/">Stateful processing with Apache Beam</a></li> +<li><a href="/blog/timely-processing/">Timely (and Stateful) Processing with Apache Beam</a></li> +</ul> <h3 id="splittable-dofn">Splittable DoFn</h3> <p>Splittable <code>DoFn</code> (SDF) is a generalization of <code>DoFn</code> that lets you process elements in a non-monolithic way. Splittable <code>DoFn</code> makes it easier to create diff --git a/website/generated-content/sitemap.xml b/website/generated-content/sitemap.xml index bd4d515..69b8740 100644 --- a/website/generated-content/sitemap.xml +++ b/website/generated-content/sitemap.xml @@ -1 +1 @@ -<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.34.0/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/g [...] \ No newline at end of file +<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.34.0/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-11-11T11:07:06-08:00</lastmod></url><url><loc>/blog/g [...] \ No newline at end of file