This is an automated email from the ASF dual-hosted git repository. git-site-role pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/asf-site by this push: new 017fed1 Publishing website 2021/10/22 00:01:40 at commit 3d8213a 017fed1 is described below commit 017fed11ef4145b5e545b8fd1017eddfbc74fa2b Author: jenkins <bui...@apache.org> AuthorDate: Fri Oct 22 00:01:41 2021 +0000 Publishing website 2021/10/22 00:01:40 at commit 3d8213a --- .../documentation/basics/index.html | 135 +++++++++----- website/generated-content/documentation/index.html | 27 ++- website/generated-content/documentation/index.xml | 206 ++++++++++++++------- website/generated-content/sitemap.xml | 2 +- 4 files changed, 251 insertions(+), 119 deletions(-) diff --git a/website/generated-content/documentation/basics/index.html b/website/generated-content/documentation/basics/index.html index 13606d4..3de8287 100644 --- a/website/generated-content/documentation/basics/index.html +++ b/website/generated-content/documentation/basics/index.html @@ -18,62 +18,101 @@ function addPlaceholder(){$('input:text').attr('placeholder',"What are you looking for?");} function endSearch(){var search=document.querySelector(".searchBar");search.classList.add("disappear");var icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");} function blockScroll(){$("body").toggleClass("fixedPosition");} -function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...] +function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...] of operations. You want to integrate it with the Beam ecosystem to get access to other languages, great event time processing, and a library of connectors. -You need to know the core vocabulary:</p><ul><li><a href=#pipeline><em>Pipeline</em></a> - A pipeline is a graph of transformations that a user constructs -that defines the data processing they want to do.</li><li><a href=#pcollections><em>PCollection</em></a> - Data being processed in a pipeline is part of a PCollection.</li><li><a href=#ptransforms><em>PTransforms</em></a> - The operations executed within a pipeline. These are best -thought of as operations on PCollections.</li><li><em>SDK</em> - A language-specific library for pipeline authors (we often call them +You need to know the core vocabulary:</p><ul><li><a href=#pipeline><em>Pipeline</em></a> - A pipeline is a user-constructed graph of +transformations that defines the desired data processing operations.</li><li><a href=#pcollection><em>PCollection</em></a> - A <code>PCollection</code> is a data set or data +stream. The data that a pipeline processes is part of a PCollection.</li><li><a href=#ptransform><em>PTransform</em></a> - A <code>PTransform</code> (or <em>transform</em>) represents a +data processing operation, or a step, in your pipeline. A transform is +applied to zero or more <code>PCollection</code> objects, and produces zero or more +<code>PCollection</code> objects.</li><li><em>SDK</em> - A language-specific library for pipeline authors (we often call them “users” even though we have many kinds of users) to build transforms, construct their pipelines and submit them to a runner</li><li><em>Runner</em> - You are going to write a piece of software called a runner that takes a Beam pipeline and executes it using the capabilities of your data processing engine.</li></ul><p>These concepts may be very similar to your processing engine’s concepts. Since Beam’s design is for cross-language operation and reusable libraries of -transforms, there are some special features worth highlighting.</p><h3 id=pipeline>Pipeline</h3><p>A pipeline in Beam is a graph of PTransforms operating on PCollections. A -pipeline is constructed by a user in their SDK of choice, and makes its way to -your runner either via the SDK directly or via the Runner API’s -RPC interfaces.</p><h3 id=ptransforms>PTransforms</h3><p>A <code>PTransform</code> represents a data processing operation, or a step, -in your pipeline. A <code>PTransform</code> can be applied to one or more -<code>PCollection</code> objects as input which performs some processing on the elements of that -<code>PCollection</code> and produces zero or more output <code>PCollection</code> objects.</p><h3 id=pcollections>PCollections</h3><p>A PCollection is an unordered bag of elements. Your runner will be responsible -for storing these elements. There are some major aspects of a PCollection to -note:</p><h4 id=bounded-vs-unbounded>Bounded vs Unbounded</h4><p>A PCollection may be bounded or unbounded.</p><ul><li><em>Bounded</em> - it is finite and you know it, as in batch use cases</li><li><em>Unbounded</em> - it may be never end, you don’t know, as in streaming use cases</li></ul><p>These derive from the intuitions of batch and stream processing, but the two -are unified in Beam and bounded and unbounded PCollections can coexist in the -same pipeline. If your runner can only support bounded PCollections, you’ll -need to reject pipelines that contain unbounded PCollections. If your -runner is only really targeting streams, there are adapters in our support code -to convert everything to APIs targeting unbounded data.</p><h4 id=timestamps>Timestamps</h4><p>Every element in a PCollection has a timestamp associated with it.</p><p>When you execute a primitive connector to some storage system, that connector -is responsible for providing initial timestamps. Your runner will need to -propagate and aggregate timestamps. If the timestamp is not important, as with -certain batch processing jobs where elements do not denote events, they will be -the minimum representable timestamp, often referred to colloquially as -“negative infinity”.</p><h4 id=watermarks>Watermarks</h4><p>Every PCollection has to have a watermark that estimates how complete the -PCollection is.</p><p>The watermark is a guess that “we’ll never see an element with an earlier -timestamp”. Sources of data are responsible for producing a watermark. Your -runner needs to implement watermark propagation as PCollections are processed, -merged, and partitioned.</p><p>The contents of a PCollection are complete when a watermark advances to -“infinity”. In this manner, you may discover that an unbounded PCollection is -finite.</p><h4 id=windowed-elements>Windowed elements</h4><p>Every element in a PCollection resides in a window. No element resides in -multiple windows (two elements can be equal except for their window, but they -are not the same).</p><p>When elements are read from the outside world they arrive in the global window. +transforms, there are some special features worth highlighting.</p><h3 id=pipeline>Pipeline</h3><p>A Beam pipeline is a graph (specifically, a +<a href=https://en.wikipedia.org/wiki/Directed_acyclic_graph>directed acyclic graph</a>) +of all the data and computations in your data processing task. This includes +reading input data, transforming that data, and writing output data. A pipeline +is constructed by a user in their SDK of choice. Then, the pipeline makes its +way to the runner either through the SDK directly or through the Runner API’s +RPC interface. For example, this diagram shows a branching pipeline:</p><p><img src=/images/design-your-pipeline-multiple-pcollections.svg alt="The pipeline applies two transforms to a single input collection. Eachtransform produces an output collection."></p><p>In this diagram, the boxes represent the parallel computations called +<a href=#ptransform><em>PTransforms</em></a> and the arrows with the circles represent the data +(in the form of <a href=#pcollection><em>PCollections</em></a>) that flows between the +transforms. The data might be bounded, stored, data sets, or the data might also +be unbounded streams of data. In Beam, most transforms apply equally to bounded +and unbounded data.</p><p>You can express almost any computation that you can think of as a graph as a +Beam pipeline. A Beam driver program typically starts by creating a <code>Pipeline</code> +object, and then uses that object as the basis for creating the pipeline’s data +sets and its transforms.</p><p>For more information about pipelines, see the following pages:</p><ul><li><a href=/documentation/programming-guide/#overview>Beam Programming Guide: Overview</a></li><li><a href=/documentation/programming-guide/#creating-a-pipeline>Beam Programming Guide: Creating a pipeline</a></li><li><a href=/documentation/pipelines/design-your-pipeline>Design your pipeline</a></li><li><a href=/documentation/pipeline/create-your-pipeline>Create your pipeline</a></li></ul [...] +in your pipeline. A transform is usually applied to one or more input +<code>PCollection</code> objects. Transforms that read input are an exception; these +transforms might not have an input <code>PCollection</code>.</p><p>You provide transform processing logic in the form of a function object +(colloquially referred to as “user code”), and your user code is applied to each +element of the input PCollection (or more than one PCollection). Depending on +the pipeline runner and backend that you choose, many different workers across a +cluster might execute instances of your user code in parallel. The user code +that runs on each worker generates the output elements that are added to zero or +more output <code>PCollection</code> objects.</p><p>The Beam SDKs contain a number of different transforms that you can apply to +your pipeline’s PCollections. These include general-purpose core transforms, +such as <code>ParDo</code> or <code>Combine</code>. There are also pre-written composite transforms +included in the SDKs, which combine one or more of the core transforms in a +useful processing pattern, such as counting or combining elements in a +collection. You can also define your own more complex composite transforms to +fit your pipeline’s exact use case.</p><p>The following list has some common transform types:</p><ul><li>Source transforms such as <code>TextIO.Read</code> and <code>Create</code>. A source transform +conceptually has no input.</li><li>Processing and conversion operations such as <code>ParDo</code>, <code>GroupByKey</code>, +<code>CoGroupByKey</code>, <code>Combine</code>, and <code>Count</code>.</li><li>Outputting transforms such as <code>TextIO.Write</code>.</li><li>User-defined, application-specific composite transforms.</li></ul><p>For more information about transforms, see the following pages:</p><ul><li><a href=/documentation/programming-guide/#overview>Beam Programming Guide: Overview</a></li><li><a href=/documentation/programming-guide/#transforms>Beam Programming Guide: Transforms</a></li><li>Beam t [...] +<a href=/documentation/transforms/python/overview/>Python</a>)</li></ul><h3 id=pcollection>PCollection</h3><p>A <code>PCollection</code> is an unordered bag of elements. Each <code>PCollection</code> is a +potentially distributed, homogeneous data set or data stream, and is owned by +the specific <code>Pipeline</code> object for which it is created. Multiple pipelines +cannot share a <code>PCollection</code>. Beam pipelines process PCollections, and the +runner is responsible for storing these elements.</p><p>A <code>PCollection</code> generally contains “big data” (too much data to fit in memory on +a single machine). Sometimes a small sample of data or an intermediate result +might fit into memory on a single machine, but Beam’s computational patterns and +transforms are focused on situations where distributed data-parallel computation +is required. Therefore, the elements of a <code>PCollection</code> cannot be processed +individually, and are instead processed uniformly in parallel.</p><p>The following characteristics of a <code>PCollection</code> are important to know.</p><h4 id=bounded-vs-unbounded>Bounded vs unbounded</h4><p>A <code>PCollection</code> can be either bounded or unbounded.</p><ul><li>A <em>bounded</em> <code>PCollection</code> is a dataset of a known, fixed size (alternatively, +a dataset that is not growing over time). Bounded data can be processed by +batch pipelines.</li><li>An <em>unbounded</em> <code>PCollection</code> is a dataset that grows over time, and the +elements are processed as they arrive. Unbounded data must be processed by +streaming pipelines.</li></ul><p>These two categories derive from the intuitions of batch and stream processing, +but the two are unified in Beam and bounded and unbounded PCollections can +coexist in the same pipeline. If your runner can only support bounded +PCollections, you must reject pipelines that contain unbounded PCollections. If +your runner is only targeting streams, there are adapters in Beam’s support code +to convert everything to APIs that target unbounded data.</p><h4 id=timestamps>Timestamps</h4><p>Every element in a <code>PCollection</code> has a timestamp associated with it.</p><p>When you execute a primitive connector to a storage system, that connector is +responsible for providing initial timestamps. The runner must propagate and +aggregate timestamps. If the timestamp is not important, such as with certain +batch processing jobs where elements do not denote events, the timestamp will be +the minimum representable timestamp, often referred to colloquially as “negative +infinity”.</p><h4 id=watermarks>Watermarks</h4><p>Every <code>PCollection</code> must have a watermark that estimates how complete the +<code>PCollection</code> is.</p><p>The watermark is a guess that “we’ll never see an element with an earlier +timestamp”. Data sources are responsible for producing a watermark. The runner +must implement watermark propagation as PCollections are processed, merged, and +partitioned.</p><p>The contents of a <code>PCollection</code> are complete when a watermark advances to +“infinity”. In this manner, you can discover that an unbounded PCollection is +finite.</p><h4 id=windowed-elements>Windowed elements</h4><p>Every element in a <code>PCollection</code> resides in a window. No element resides in +multiple windows; two elements can be equal except for their window, but they +are not the same.</p><p>When elements are read from the outside world, they arrive in the global window. When they are written to the outside world, they are effectively placed back -into the global window (any writing transform that doesn’t take this -perspective probably risks data loss).</p><p>A window has a maximum timestamp, and when the watermark exceeds this plus -user-specified allowed lateness the window is expired. All data related -to an expired window may be discarded at any time.</p><h4 id=coder>Coder</h4><p>Every PCollection has a coder, a specification of the binary format of the elements.</p><p>In Beam, the user’s pipeline may be written in a language other than the +into the global window. Transforms that write data and don’t take this +perspective probably risks data loss.</p><p>A window has a maximum timestamp. When the watermark exceeds the maximum +timestamp plus the user-specified allowed lateness, the window is expired. All +data related to an expired window might be discarded at any time.</p><h4 id=coder>Coder</h4><p>Every <code>PCollection</code> has a coder, which is a specification of the binary format +of the elements.</p><p>In Beam, the user’s pipeline can be written in a language other than the language of the runner. There is no expectation that the runner can actually -deserialize user data. So the Beam model operates principally on encoded data - -“just bytes”. Each PCollection has a declared encoding for its elements, called -a coder. A coder has a URN that identifies the encoding, and may have -additional sub-coders (for example, a coder for lists may contain a coder for -the elements of the list). Language-specific serialization techniques can, and -frequently are used, but there are a few key formats - such as key-value pairs -and timestamps - that are common so your runner can understand them.</p><h4 id=windowing-strategy>Windowing Strategy</h4><p>Every PCollection has a windowing strategy, a specification of essential -information for grouping and triggering operations.</p><p>The details will be discussed below when we discuss the -<a href=#implementing-the-window-primitive>Window</a> primitive, which sets up the -windowing strategy, and -<a href=#implementing-the-groupbykey-and-window-primitive>GroupByKey</a> primitive, -which has behavior governed by the windowing strategy.</p><h3 id=user-defined-functions-udfs>User-Defined Functions (UDFs)</h3><p>Beam has seven varieties of user-defined function (UDF). A Beam pipeline +deserialize user data. The Beam model operates principally on encoded data, +“just bytes”. Each <code>PCollection</code> has a declared encoding for its elements, +called a coder. A coder has a URN that identifies the encoding, and might have +additional sub-coders. For example, a coder for lists might contain a coder for +the elements of the list. Language-specific serialization techniques are +frequently used, but there are a few common key formats (such as key-value pairs +and timestamps) so the runner can understand them.</p><h4 id=windowing-strategy>Windowing strategy</h4><p>Every <code>PCollection</code> has a windowing strategy, which is a specification of +essential information for grouping and triggering operations. The <code>Window</code> +transform sets up the windowing strategy, and the <code>GroupByKey</code> transform has +behavior that is governed by the windowing strategy.</p><br><p>For more information about PCollections, see the following page:</p><ul><li><a href=/documentation/programming-guide/#pcollections>Beam Programming Guide: PCollections</a></li></ul><h3 id=user-defined-functions-udfs>User-Defined Functions (UDFs)</h3><p>Beam has seven varieties of user-defined function (UDF). A Beam pipeline may contain UDFs written in a language other than your runner, or even multiple languages in the same pipeline (see the <a href=#the-runner-api>Runner API</a>) so the definitions are language-independent (see the <a href=#the-fn-api>Fn API</a>).</p><p>The UDFs of Beam are:</p><ul><li><em>DoFn</em> - per-element processing function (used in ParDo)</li><li><em>WindowFn</em> - places elements in windows and merges windows (used in Window @@ -92,7 +131,7 @@ use code font for proper nouns in our APIs, whether or not the identifiers match across all SDKs.</p><p>The <code>run(Pipeline)</code> method should be asynchronous and results in a PipelineResult which generally will be a job descriptor for your data processing engine, providing methods for checking its status, canceling it, and -waiting for it to terminate.</p><div class=feedback><p class=update>Last updated on 2021/06/03</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all useful and clear? Is there anything that you would like to change? Let us know!</p><button class=load-button><a href="mailto:d...@beam.apache.org?subject=Beam Website Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div class=footer__contained><div class=footer__cols><div class=" [...] +waiting for it to terminate.</p><div class=feedback><p class=update>Last updated on 2021/10/21</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all useful and clear? Is there anything that you would like to change? Let us know!</p><button class=load-button><a href="mailto:d...@beam.apache.org?subject=Beam Website Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div class=footer__contained><div class=footer__cols><div class=" [...] <a href=http://www.apache.org>The Apache Software Foundation</a> | <a href=/privacy_policy>Privacy Policy</a> | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></div></div></div></footer></body></html> \ No newline at end of file diff --git a/website/generated-content/documentation/index.html b/website/generated-content/documentation/index.html index 3dcde7e..ebd68a2 100644 --- a/website/generated-content/documentation/index.html +++ b/website/generated-content/documentation/index.html @@ -18,9 +18,32 @@ function addPlaceholder(){$('input:text').attr('placeholder',"What are you looking for?");} function endSearch(){var search=document.querySelector(".searchBar");search.classList.add("disappear");var icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");} function blockScroll(){$("body").toggleClass("fixedPosition");} -function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...] +function openMenu(){addPlaceholder();blockScroll();}</script><div class="clearfix container-main-content"><div class="section-nav closed" data-offset-top=90 data-offset-bottom=500><span class="section-nav-back glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list data-section-nav><li><span class=section-nav-list-main-title>Documentation</span></li><li><a href=/documentation>Using the Documentation</a></li><li class=section-nav-item--collapsible><span class=section-nav-lis [...] +the Beam programming model, SDKs, and runners.</p><h2 id=concepts>Concepts</h2><p>Learn about the Beam Programming Model and the concepts common to all Beam SDKs +and Runners.</p><ul><li>Start with the <a href=/documentation/basics/>Basics of the Beam model</a> for +introductory conceptual information.</li><li>Read the <a href=/documentation/programming-guide/>Programming Guide</a>, which +has more detailed information about the Beam concepts and provides code +snippets.</li><li>Learn about Beam’s <a href=/documentation/runtime/model>execution model</a> to better +understand how pipelines execute.</li><li>Visit <a href=/documentation/resources/learning-resources>Learning Resources</a> for +some of our favorite articles and talks about Beam.</li><li>Reference the <a href=/documentation/glossary>glossary</a> to learn the terminology of the +Beam programming model.</li></ul><h2 id=pipeline-fundamentals>Pipeline Fundamentals</h2><ul><li><a href=/documentation/pipelines/design-your-pipeline/>Design Your Pipeline</a> by +planning your pipeline’s structure, choosing transforms to apply to your data, +and determining your input and output methods.</li><li><a href=/documentation/pipelines/create-your-pipeline/>Create Your Pipeline</a> using +the classes in the Beam SDKs.</li><li><a href=/documentation/pipelines/test-your-pipeline/>Test Your Pipeline</a> to minimize +debugging a pipeline’s remote execution.</li></ul><h2 id=sdks>SDKs</h2><p>Find status and reference information on all of the available Beam SDKs.</p><div class=sdks><div class=item-description><a href=/documentation/sdks/java/ class=font-weight-bold>Java SDK</a><p></p></div><div class=item-description><a href=/documentation/sdks/python/ class=font-weight-bold>Python SDK</a><p></p></div><div class=item-description><a href=/documentation/sdks/go/ class=font-weight-bold>Go SDK</a><p></p></ [...] +built-in transforms.</p><ul><li><a href=/documentation/transforms/java/overview/>Java transform catalog</a></li><li><a href=/documentation/transforms/python/overview/>Python transform catalog</a></li></ul><h2 id=runners>Runners</h2><p>A Beam Runner runs a Beam pipeline on a specific (often distributed) data +processing system.</p><h3 id=available-runners>Available Runners</h3><div class="documentation-list mobile-column"><div class=row><div class=column><div class=item-icon><svg width="67" height="84" viewBox="0 0 67 84" fill="none" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><rect x=".500977" y=".450195" width="188.235" height="83.4223" fill="url(#pattern5)"/><defs><pattern id="pattern5" patternContentUnits="objectBoundingBox" width="1" height="1"><use xlin [...] button.innerHTML="+ SHOW MORE";else -button.innerHTML="- SHOW LESS";}</script><h3 id=choosing-a-runner>Choosing a Runner</h3><p>Beam is designed to enable pipelines to be portable across different runners. However, given every runner has different capabilities, they also have different abilities to implement the core concepts in the Beam model. The <a href=/documentation/runners/capability-matrix/>Capability Matrix</a> provides a detailed comparison of runner functionality.</p><p>Once you have chosen which runner to use, se [...] +button.innerHTML="- SHOW LESS";}</script><h3 id=choosing-a-runner>Choosing a Runner</h3><p>Beam is designed to enable pipelines to be portable across different runners. +However, given every runner has different capabilities, they also have different +abilities to implement the core concepts in the Beam model. The +<a href=/documentation/runners/capability-matrix/>Capability Matrix</a> provides a +detailed comparison of runner functionality.</p><p>Once you have chosen which runner to use, see that runner’s page for more +information about any initial runner-specific setup as well as any required or +optional <code>PipelineOptions</code> for configuring its execution. You might also want to +refer back to the Quickstart for <a href=/get-started/quickstart-java>Java</a>, +<a href=/get-started/quickstart-py>Python</a> or <a href=/get-started/quickstart-go>Go</a> for +instructions on executing the sample WordCount pipeline.</p><div class=feedback><p class=update>Last updated on 2021/10/21</p><h3>Have you found everything you were looking for?</h3><p class=description>Was it all useful and clear? Is there anything that you would like to change? Let us know!</p><button class=load-button><a href="mailto:d...@beam.apache.org?subject=Beam Website Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div class=footer__contained><div cla [...] <a href=http://www.apache.org>The Apache Software Foundation</a> | <a href=/privacy_policy>Privacy Policy</a> | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></div></div></div></footer></body></html> \ No newline at end of file diff --git a/website/generated-content/documentation/index.xml b/website/generated-content/documentation/index.xml index 17ed239..48ba7f5 100644 --- a/website/generated-content/documentation/index.xml +++ b/website/generated-content/documentation/index.xml @@ -3185,11 +3185,14 @@ of operations. You want to integrate it with the Beam ecosystem to get access to other languages, great event time processing, and a library of connectors. You need to know the core vocabulary:</p> <ul> -<li><a href="#pipeline"><em>Pipeline</em></a> - A pipeline is a graph of transformations that a user constructs -that defines the data processing they want to do.</li> -<li><a href="#pcollections"><em>PCollection</em></a> - Data being processed in a pipeline is part of a PCollection.</li> -<li><a href="#ptransforms"><em>PTransforms</em></a> - The operations executed within a pipeline. These are best -thought of as operations on PCollections.</li> +<li><a href="#pipeline"><em>Pipeline</em></a> - A pipeline is a user-constructed graph of +transformations that defines the desired data processing operations.</li> +<li><a href="#pcollection"><em>PCollection</em></a> - A <code>PCollection</code> is a data set or data +stream. The data that a pipeline processes is part of a PCollection.</li> +<li><a href="#ptransform"><em>PTransform</em></a> - A <code>PTransform</code> (or <em>transform</em>) represents a +data processing operation, or a step, in your pipeline. A transform is +applied to zero or more <code>PCollection</code> objects, and produces zero or more +<code>PCollection</code> objects.</li> <li><em>SDK</em> - A language-specific library for pipeline authors (we often call them &ldquo;users&rdquo; even though we have many kinds of users) to build transforms, construct their pipelines and submit them to a runner</li> @@ -3201,79 +3204,146 @@ processing engine.</li> Beam&rsquo;s design is for cross-language operation and reusable libraries of transforms, there are some special features worth highlighting.</p> <h3 id="pipeline">Pipeline</h3> -<p>A pipeline in Beam is a graph of PTransforms operating on PCollections. A -pipeline is constructed by a user in their SDK of choice, and makes its way to -your runner either via the SDK directly or via the Runner API&rsquo;s -RPC interfaces.</p> -<h3 id="ptransforms">PTransforms</h3> -<p>A <code>PTransform</code> represents a data processing operation, or a step, -in your pipeline. A <code>PTransform</code> can be applied to one or more -<code>PCollection</code> objects as input which performs some processing on the elements of that -<code>PCollection</code> and produces zero or more output <code>PCollection</code> objects.</p> -<h3 id="pcollections">PCollections</h3> -<p>A PCollection is an unordered bag of elements. Your runner will be responsible -for storing these elements. There are some major aspects of a PCollection to -note:</p> -<h4 id="bounded-vs-unbounded">Bounded vs Unbounded</h4> -<p>A PCollection may be bounded or unbounded.</p> +<p>A Beam pipeline is a graph (specifically, a +<a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">directed acyclic graph</a>) +of all the data and computations in your data processing task. This includes +reading input data, transforming that data, and writing output data. A pipeline +is constructed by a user in their SDK of choice. Then, the pipeline makes its +way to the runner either through the SDK directly or through the Runner API&rsquo;s +RPC interface. For example, this diagram shows a branching pipeline:</p> +<p><img src="/images/design-your-pipeline-multiple-pcollections.svg" alt="The pipeline applies two transforms to a single input collection. Eachtransform produces an output collection."></p> +<p>In this diagram, the boxes represent the parallel computations called +<a href="#ptransform"><em>PTransforms</em></a> and the arrows with the circles represent the data +(in the form of <a href="#pcollection"><em>PCollections</em></a>) that flows between the +transforms. The data might be bounded, stored, data sets, or the data might also +be unbounded streams of data. In Beam, most transforms apply equally to bounded +and unbounded data.</p> +<p>You can express almost any computation that you can think of as a graph as a +Beam pipeline. A Beam driver program typically starts by creating a <code>Pipeline</code> +object, and then uses that object as the basis for creating the pipeline’s data +sets and its transforms.</p> +<p>For more information about pipelines, see the following pages:</p> +<ul> +<li><a href="/documentation/programming-guide/#overview">Beam Programming Guide: Overview</a></li> +<li><a href="/documentation/programming-guide/#creating-a-pipeline">Beam Programming Guide: Creating a pipeline</a></li> +<li><a href="/documentation/pipelines/design-your-pipeline">Design your pipeline</a></li> +<li><a href="/documentation/pipeline/create-your-pipeline">Create your pipeline</a></li> +</ul> +<h3 id="ptransform">PTransform</h3> +<p>A <code>PTransform</code> (or transform) represents a data processing operation, or a step, +in your pipeline. A transform is usually applied to one or more input +<code>PCollection</code> objects. Transforms that read input are an exception; these +transforms might not have an input <code>PCollection</code>.</p> +<p>You provide transform processing logic in the form of a function object +(colloquially referred to as “user code”), and your user code is applied to each +element of the input PCollection (or more than one PCollection). Depending on +the pipeline runner and backend that you choose, many different workers across a +cluster might execute instances of your user code in parallel. The user code +that runs on each worker generates the output elements that are added to zero or +more output <code>PCollection</code> objects.</p> +<p>The Beam SDKs contain a number of different transforms that you can apply to +your pipeline’s PCollections. These include general-purpose core transforms, +such as <code>ParDo</code> or <code>Combine</code>. There are also pre-written composite transforms +included in the SDKs, which combine one or more of the core transforms in a +useful processing pattern, such as counting or combining elements in a +collection. You can also define your own more complex composite transforms to +fit your pipeline’s exact use case.</p> +<p>The following list has some common transform types:</p> +<ul> +<li>Source transforms such as <code>TextIO.Read</code> and <code>Create</code>. A source transform +conceptually has no input.</li> +<li>Processing and conversion operations such as <code>ParDo</code>, <code>GroupByKey</code>, +<code>CoGroupByKey</code>, <code>Combine</code>, and <code>Count</code>.</li> +<li>Outputting transforms such as <code>TextIO.Write</code>.</li> +<li>User-defined, application-specific composite transforms.</li> +</ul> +<p>For more information about transforms, see the following pages:</p> <ul> -<li><em>Bounded</em> - it is finite and you know it, as in batch use cases</li> -<li><em>Unbounded</em> - it may be never end, you don&rsquo;t know, as in streaming use cases</li> +<li><a href="/documentation/programming-guide/#overview">Beam Programming Guide: Overview</a></li> +<li><a href="/documentation/programming-guide/#transforms">Beam Programming Guide: Transforms</a></li> +<li>Beam transform catalog (<a href="/documentation/transforms/java/overview/">Java</a>, +<a href="/documentation/transforms/python/overview/">Python</a>)</li> </ul> -<p>These derive from the intuitions of batch and stream processing, but the two -are unified in Beam and bounded and unbounded PCollections can coexist in the -same pipeline. If your runner can only support bounded PCollections, you&rsquo;ll -need to reject pipelines that contain unbounded PCollections. If your -runner is only really targeting streams, there are adapters in our support code -to convert everything to APIs targeting unbounded data.</p> +<h3 id="pcollection">PCollection</h3> +<p>A <code>PCollection</code> is an unordered bag of elements. Each <code>PCollection</code> is a +potentially distributed, homogeneous data set or data stream, and is owned by +the specific <code>Pipeline</code> object for which it is created. Multiple pipelines +cannot share a <code>PCollection</code>. Beam pipelines process PCollections, and the +runner is responsible for storing these elements.</p> +<p>A <code>PCollection</code> generally contains &ldquo;big data&rdquo; (too much data to fit in memory on +a single machine). Sometimes a small sample of data or an intermediate result +might fit into memory on a single machine, but Beam&rsquo;s computational patterns and +transforms are focused on situations where distributed data-parallel computation +is required. Therefore, the elements of a <code>PCollection</code> cannot be processed +individually, and are instead processed uniformly in parallel.</p> +<p>The following characteristics of a <code>PCollection</code> are important to know.</p> +<h4 id="bounded-vs-unbounded">Bounded vs unbounded</h4> +<p>A <code>PCollection</code> can be either bounded or unbounded.</p> +<ul> +<li>A <em>bounded</em> <code>PCollection</code> is a dataset of a known, fixed size (alternatively, +a dataset that is not growing over time). Bounded data can be processed by +batch pipelines.</li> +<li>An <em>unbounded</em> <code>PCollection</code> is a dataset that grows over time, and the +elements are processed as they arrive. Unbounded data must be processed by +streaming pipelines.</li> +</ul> +<p>These two categories derive from the intuitions of batch and stream processing, +but the two are unified in Beam and bounded and unbounded PCollections can +coexist in the same pipeline. If your runner can only support bounded +PCollections, you must reject pipelines that contain unbounded PCollections. If +your runner is only targeting streams, there are adapters in Beam&rsquo;s support code +to convert everything to APIs that target unbounded data.</p> <h4 id="timestamps">Timestamps</h4> -<p>Every element in a PCollection has a timestamp associated with it.</p> -<p>When you execute a primitive connector to some storage system, that connector -is responsible for providing initial timestamps. Your runner will need to -propagate and aggregate timestamps. If the timestamp is not important, as with -certain batch processing jobs where elements do not denote events, they will be -the minimum representable timestamp, often referred to colloquially as -&ldquo;negative infinity&rdquo;.</p> +<p>Every element in a <code>PCollection</code> has a timestamp associated with it.</p> +<p>When you execute a primitive connector to a storage system, that connector is +responsible for providing initial timestamps. The runner must propagate and +aggregate timestamps. If the timestamp is not important, such as with certain +batch processing jobs where elements do not denote events, the timestamp will be +the minimum representable timestamp, often referred to colloquially as &ldquo;negative +infinity&rdquo;.</p> <h4 id="watermarks">Watermarks</h4> -<p>Every PCollection has to have a watermark that estimates how complete the -PCollection is.</p> +<p>Every <code>PCollection</code> must have a watermark that estimates how complete the +<code>PCollection</code> is.</p> <p>The watermark is a guess that &ldquo;we&rsquo;ll never see an element with an earlier -timestamp&rdquo;. Sources of data are responsible for producing a watermark. Your -runner needs to implement watermark propagation as PCollections are processed, -merged, and partitioned.</p> -<p>The contents of a PCollection are complete when a watermark advances to -&ldquo;infinity&rdquo;. In this manner, you may discover that an unbounded PCollection is +timestamp&rdquo;. Data sources are responsible for producing a watermark. The runner +must implement watermark propagation as PCollections are processed, merged, and +partitioned.</p> +<p>The contents of a <code>PCollection</code> are complete when a watermark advances to +&ldquo;infinity&rdquo;. In this manner, you can discover that an unbounded PCollection is finite.</p> <h4 id="windowed-elements">Windowed elements</h4> -<p>Every element in a PCollection resides in a window. No element resides in -multiple windows (two elements can be equal except for their window, but they -are not the same).</p> -<p>When elements are read from the outside world they arrive in the global window. +<p>Every element in a <code>PCollection</code> resides in a window. No element resides in +multiple windows; two elements can be equal except for their window, but they +are not the same.</p> +<p>When elements are read from the outside world, they arrive in the global window. When they are written to the outside world, they are effectively placed back -into the global window (any writing transform that doesn&rsquo;t take this -perspective probably risks data loss).</p> -<p>A window has a maximum timestamp, and when the watermark exceeds this plus -user-specified allowed lateness the window is expired. All data related -to an expired window may be discarded at any time.</p> +into the global window. Transforms that write data and don&rsquo;t take this +perspective probably risks data loss.</p> +<p>A window has a maximum timestamp. When the watermark exceeds the maximum +timestamp plus the user-specified allowed lateness, the window is expired. All +data related to an expired window might be discarded at any time.</p> <h4 id="coder">Coder</h4> -<p>Every PCollection has a coder, a specification of the binary format of the elements.</p> -<p>In Beam, the user&rsquo;s pipeline may be written in a language other than the +<p>Every <code>PCollection</code> has a coder, which is a specification of the binary format +of the elements.</p> +<p>In Beam, the user&rsquo;s pipeline can be written in a language other than the language of the runner. There is no expectation that the runner can actually -deserialize user data. So the Beam model operates principally on encoded data - -&ldquo;just bytes&rdquo;. Each PCollection has a declared encoding for its elements, called -a coder. A coder has a URN that identifies the encoding, and may have -additional sub-coders (for example, a coder for lists may contain a coder for -the elements of the list). Language-specific serialization techniques can, and -frequently are used, but there are a few key formats - such as key-value pairs -and timestamps - that are common so your runner can understand them.</p> -<h4 id="windowing-strategy">Windowing Strategy</h4> -<p>Every PCollection has a windowing strategy, a specification of essential -information for grouping and triggering operations.</p> -<p>The details will be discussed below when we discuss the -<a href="#implementing-the-window-primitive">Window</a> primitive, which sets up the -windowing strategy, and -<a href="#implementing-the-groupbykey-and-window-primitive">GroupByKey</a> primitive, -which has behavior governed by the windowing strategy.</p> +deserialize user data. The Beam model operates principally on encoded data, +&ldquo;just bytes&rdquo;. Each <code>PCollection</code> has a declared encoding for its elements, +called a coder. A coder has a URN that identifies the encoding, and might have +additional sub-coders. For example, a coder for lists might contain a coder for +the elements of the list. Language-specific serialization techniques are +frequently used, but there are a few common key formats (such as key-value pairs +and timestamps) so the runner can understand them.</p> +<h4 id="windowing-strategy">Windowing strategy</h4> +<p>Every <code>PCollection</code> has a windowing strategy, which is a specification of +essential information for grouping and triggering operations. The <code>Window</code> +transform sets up the windowing strategy, and the <code>GroupByKey</code> transform has +behavior that is governed by the windowing strategy.</p> +<br/> +<p>For more information about PCollections, see the following page:</p> +<ul> +<li><a href="/documentation/programming-guide/#pcollections">Beam Programming Guide: PCollections</a></li> +</ul> <h3 id="user-defined-functions-udfs">User-Defined Functions (UDFs)</h3> <p>Beam has seven varieties of user-defined function (UDF). A Beam pipeline may contain UDFs written in a language other than your runner, or even multiple diff --git a/website/generated-content/sitemap.xml b/website/generated-content/sitemap.xml index b9c3307..5fed307 100644 --- a/website/generated-content/sitemap.xml +++ b/website/generated-content/sitemap.xml @@ -1 +1 @@ -<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.33.0/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/b [...] \ No newline at end of file +<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/blog/beam-2.33.0/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/b [...] \ No newline at end of file