22 00:01:40 at commit 3d8213a

git-site-role Thu, 21 Oct 2021 17:02:28 -0700

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 017fed1  Publishing website 2021/10/22 00:01:40 at commit 3d8213a
017fed1 is described below

commit 017fed11ef4145b5e545b8fd1017eddfbc74fa2b
Author: jenkins <bui...@apache.org>
AuthorDate: Fri Oct 22 00:01:41 2021 +0000

    Publishing website 2021/10/22 00:01:40 at commit 3d8213a
---
 .../documentation/basics/index.html                | 135 +++++++++-----
 website/generated-content/documentation/index.html |  27 ++-
 website/generated-content/documentation/index.xml  | 206 ++++++++++++++-------
 website/generated-content/sitemap.xml              |   2 +-
 4 files changed, 251 insertions(+), 119 deletions(-)

diff --git a/website/generated-content/documentation/basics/index.html 
b/website/generated-content/documentation/basics/index.html
index 13606d4..3de8287 100644
--- a/website/generated-content/documentation/basics/index.html
+++ b/website/generated-content/documentation/basics/index.html
@@ -18,62 +18,101 @@
 function addPlaceholder(){$('input:text').attr('placeholder',"What are you 
looking for?");}
 function endSearch(){var 
search=document.querySelector(".searchBar");search.classList.add("disappear");var
 icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");}
 function blockScroll(){$("body").toggleClass("fixedPosition");}
-function openMenu(){addPlaceholder();blockScroll();}</script><div 
class="clearfix container-main-content"><div class="section-nav closed" 
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back 
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list 
data-section-nav><li><span 
class=section-nav-list-main-title>Documentation</span></li><li><a 
href=/documentation>Using the Documentation</a></li><li 
class=section-nav-item--collapsible><span class=section-nav-lis [...]
+function openMenu(){addPlaceholder();blockScroll();}</script><div 
class="clearfix container-main-content"><div class="section-nav closed" 
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back 
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list 
data-section-nav><li><span 
class=section-nav-list-main-title>Documentation</span></li><li><a 
href=/documentation>Using the Documentation</a></li><li 
class=section-nav-item--collapsible><span class=section-nav-lis [...]
 of operations. You want to integrate it with the Beam ecosystem to get access
 to other languages, great event time processing, and a library of connectors.
-You need to know the core vocabulary:</p><ul><li><a 
href=#pipeline><em>Pipeline</em></a> - A pipeline is a graph of transformations 
that a user constructs
-that defines the data processing they want to do.</li><li><a 
href=#pcollections><em>PCollection</em></a> - Data being processed in a 
pipeline is part of a PCollection.</li><li><a 
href=#ptransforms><em>PTransforms</em></a> - The operations executed within a 
pipeline. These are best
-thought of as operations on PCollections.</li><li><em>SDK</em> - A 
language-specific library for pipeline authors (we often call them
+You need to know the core vocabulary:</p><ul><li><a 
href=#pipeline><em>Pipeline</em></a> - A pipeline is a user-constructed graph of
+transformations that defines the desired data processing 
operations.</li><li><a href=#pcollection><em>PCollection</em></a> - A 
<code>PCollection</code> is a data set or data
+stream. The data that a pipeline processes is part of a 
PCollection.</li><li><a href=#ptransform><em>PTransform</em></a> - A 
<code>PTransform</code> (or <em>transform</em>) represents a
+data processing operation, or a step, in your pipeline. A transform is
+applied to zero or more <code>PCollection</code> objects, and produces zero or 
more
+<code>PCollection</code> objects.</li><li><em>SDK</em> - A language-specific 
library for pipeline authors (we often call them
 &ldquo;users&rdquo; even though we have many kinds of users) to build 
transforms,
 construct their pipelines and submit them to a runner</li><li><em>Runner</em> 
- You are going to write a piece of software called a runner that
 takes a Beam pipeline and executes it using the capabilities of your data
 processing engine.</li></ul><p>These concepts may be very similar to your 
processing engine&rsquo;s concepts. Since
 Beam&rsquo;s design is for cross-language operation and reusable libraries of
-transforms, there are some special features worth highlighting.</p><h3 
id=pipeline>Pipeline</h3><p>A pipeline in Beam is a graph of PTransforms 
operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API&rsquo;s
-RPC interfaces.</p><h3 id=ptransforms>PTransforms</h3><p>A 
<code>PTransform</code> represents a data processing operation, or a step,
-in your pipeline. A <code>PTransform</code> can be applied to one or more
-<code>PCollection</code> objects as input which performs some processing on 
the elements of that
-<code>PCollection</code> and produces zero or more output 
<code>PCollection</code> objects.</p><h3 id=pcollections>PCollections</h3><p>A 
PCollection is an unordered bag of elements. Your runner will be responsible
-for storing these elements. There are some major aspects of a PCollection to
-note:</p><h4 id=bounded-vs-unbounded>Bounded vs Unbounded</h4><p>A PCollection 
may be bounded or unbounded.</p><ul><li><em>Bounded</em> - it is finite and you 
know it, as in batch use cases</li><li><em>Unbounded</em> - it may be never 
end, you don&rsquo;t know, as in streaming use cases</li></ul><p>These derive 
from the intuitions of batch and stream processing, but the two
-are unified in Beam and bounded and unbounded PCollections can coexist in the
-same pipeline. If your runner can only support bounded PCollections, 
you&rsquo;ll
-need to reject pipelines that contain unbounded PCollections. If your
-runner is only really targeting streams, there are adapters in our support code
-to convert everything to APIs targeting unbounded data.</p><h4 
id=timestamps>Timestamps</h4><p>Every element in a PCollection has a timestamp 
associated with it.</p><p>When you execute a primitive connector to some 
storage system, that connector
-is responsible for providing initial timestamps. Your runner will need to
-propagate and aggregate timestamps. If the timestamp is not important, as with
-certain batch processing jobs where elements do not denote events, they will be
-the minimum representable timestamp, often referred to colloquially as
-&ldquo;negative infinity&rdquo;.</p><h4 id=watermarks>Watermarks</h4><p>Every 
PCollection has to have a watermark that estimates how complete the
-PCollection is.</p><p>The watermark is a guess that &ldquo;we&rsquo;ll never 
see an element with an earlier
-timestamp&rdquo;. Sources of data are responsible for producing a watermark. 
Your
-runner needs to implement watermark propagation as PCollections are processed,
-merged, and partitioned.</p><p>The contents of a PCollection are complete when 
a watermark advances to
-&ldquo;infinity&rdquo;. In this manner, you may discover that an unbounded 
PCollection is
-finite.</p><h4 id=windowed-elements>Windowed elements</h4><p>Every element in 
a PCollection resides in a window. No element resides in
-multiple windows (two elements can be equal except for their window, but they
-are not the same).</p><p>When elements are read from the outside world they 
arrive in the global window.
+transforms, there are some special features worth highlighting.</p><h3 
id=pipeline>Pipeline</h3><p>A Beam pipeline is a graph (specifically, a
+<a href=https://en.wikipedia.org/wiki/Directed_acyclic_graph>directed acyclic 
graph</a>)
+of all the data and computations in your data processing task. This includes
+reading input data, transforming that data, and writing output data. A pipeline
+is constructed by a user in their SDK of choice. Then, the pipeline makes its
+way to the runner either through the SDK directly or through the Runner 
API&rsquo;s
+RPC interface. For example, this diagram shows a branching 
pipeline:</p><p><img src=/images/design-your-pipeline-multiple-pcollections.svg 
alt="The pipeline applies two transforms to a single input collection. 
Eachtransform produces an output collection."></p><p>In this diagram, the boxes 
represent the parallel computations called
+<a href=#ptransform><em>PTransforms</em></a> and the arrows with the circles 
represent the data
+(in the form of <a href=#pcollection><em>PCollections</em></a>) that flows 
between the
+transforms. The data might be bounded, stored, data sets, or the data might 
also
+be unbounded streams of data. In Beam, most transforms apply equally to bounded
+and unbounded data.</p><p>You can express almost any computation that you can 
think of as a graph as a
+Beam pipeline. A Beam driver program typically starts by creating a 
<code>Pipeline</code>
+object, and then uses that object as the basis for creating the pipeline’s data
+sets and its transforms.</p><p>For more information about pipelines, see the 
following pages:</p><ul><li><a 
href=/documentation/programming-guide/#overview>Beam Programming Guide: 
Overview</a></li><li><a 
href=/documentation/programming-guide/#creating-a-pipeline>Beam Programming 
Guide: Creating a pipeline</a></li><li><a 
href=/documentation/pipelines/design-your-pipeline>Design your 
pipeline</a></li><li><a 
href=/documentation/pipeline/create-your-pipeline>Create your 
pipeline</a></li></ul [...]
+in your pipeline. A transform is usually applied to one or more input
+<code>PCollection</code> objects. Transforms that read input are an exception; 
these
+transforms might not have an input <code>PCollection</code>.</p><p>You provide 
transform processing logic in the form of a function object
+(colloquially referred to as “user code”), and your user code is applied to 
each
+element of the input PCollection (or more than one PCollection). Depending on
+the pipeline runner and backend that you choose, many different workers across 
a
+cluster might execute instances of your user code in parallel. The user code
+that runs on each worker generates the output elements that are added to zero 
or
+more output <code>PCollection</code> objects.</p><p>The Beam SDKs contain a 
number of different transforms that you can apply to
+your pipeline’s PCollections. These include general-purpose core transforms,
+such as <code>ParDo</code> or <code>Combine</code>. There are also pre-written 
composite transforms
+included in the SDKs, which combine one or more of the core transforms in a
+useful processing pattern, such as counting or combining elements in a
+collection. You can also define your own more complex composite transforms to
+fit your pipeline’s exact use case.</p><p>The following list has some common 
transform types:</p><ul><li>Source transforms such as <code>TextIO.Read</code> 
and <code>Create</code>. A source transform
+conceptually has no input.</li><li>Processing and conversion operations such 
as <code>ParDo</code>, <code>GroupByKey</code>,
+<code>CoGroupByKey</code>, <code>Combine</code>, and 
<code>Count</code>.</li><li>Outputting transforms such as 
<code>TextIO.Write</code>.</li><li>User-defined, application-specific composite 
transforms.</li></ul><p>For more information about transforms, see the 
following pages:</p><ul><li><a 
href=/documentation/programming-guide/#overview>Beam Programming Guide: 
Overview</a></li><li><a href=/documentation/programming-guide/#transforms>Beam 
Programming Guide: Transforms</a></li><li>Beam t [...]
+<a href=/documentation/transforms/python/overview/>Python</a>)</li></ul><h3 
id=pcollection>PCollection</h3><p>A <code>PCollection</code> is an unordered 
bag of elements. Each <code>PCollection</code> is a
+potentially distributed, homogeneous data set or data stream, and is owned by
+the specific <code>Pipeline</code> object for which it is created. Multiple 
pipelines
+cannot share a <code>PCollection</code>. Beam pipelines process PCollections, 
and the
+runner is responsible for storing these elements.</p><p>A 
<code>PCollection</code> generally contains &ldquo;big data&rdquo; (too much 
data to fit in memory on
+a single machine). Sometimes a small sample of data or an intermediate result
+might fit into memory on a single machine, but Beam&rsquo;s computational 
patterns and
+transforms are focused on situations where distributed data-parallel 
computation
+is required. Therefore, the elements of a <code>PCollection</code> cannot be 
processed
+individually, and are instead processed uniformly in parallel.</p><p>The 
following characteristics of a <code>PCollection</code> are important to 
know.</p><h4 id=bounded-vs-unbounded>Bounded vs unbounded</h4><p>A 
<code>PCollection</code> can be either bounded or unbounded.</p><ul><li>A 
<em>bounded</em> <code>PCollection</code> is a dataset of a known, fixed size 
(alternatively,
+a dataset that is not growing over time). Bounded data can be processed by
+batch pipelines.</li><li>An <em>unbounded</em> <code>PCollection</code> is a 
dataset that grows over time, and the
+elements are processed as they arrive. Unbounded data must be processed by
+streaming pipelines.</li></ul><p>These two categories derive from the 
intuitions of batch and stream processing,
+but the two are unified in Beam and bounded and unbounded PCollections can
+coexist in the same pipeline. If your runner can only support bounded
+PCollections, you must reject pipelines that contain unbounded PCollections. If
+your runner is only targeting streams, there are adapters in Beam&rsquo;s 
support code
+to convert everything to APIs that target unbounded data.</p><h4 
id=timestamps>Timestamps</h4><p>Every element in a <code>PCollection</code> has 
a timestamp associated with it.</p><p>When you execute a primitive connector to 
a storage system, that connector is
+responsible for providing initial timestamps. The runner must propagate and
+aggregate timestamps. If the timestamp is not important, such as with certain
+batch processing jobs where elements do not denote events, the timestamp will 
be
+the minimum representable timestamp, often referred to colloquially as 
&ldquo;negative
+infinity&rdquo;.</p><h4 id=watermarks>Watermarks</h4><p>Every 
<code>PCollection</code> must have a watermark that estimates how complete the
+<code>PCollection</code> is.</p><p>The watermark is a guess that 
&ldquo;we&rsquo;ll never see an element with an earlier
+timestamp&rdquo;. Data sources are responsible for producing a watermark. The 
runner
+must implement watermark propagation as PCollections are processed, merged, and
+partitioned.</p><p>The contents of a <code>PCollection</code> are complete 
when a watermark advances to
+&ldquo;infinity&rdquo;. In this manner, you can discover that an unbounded 
PCollection is
+finite.</p><h4 id=windowed-elements>Windowed elements</h4><p>Every element in 
a <code>PCollection</code> resides in a window. No element resides in
+multiple windows; two elements can be equal except for their window, but they
+are not the same.</p><p>When elements are read from the outside world, they 
arrive in the global window.
 When they are written to the outside world, they are effectively placed back
-into the global window (any writing transform that doesn&rsquo;t take this
-perspective probably risks data loss).</p><p>A window has a maximum timestamp, 
and when the watermark exceeds this plus
-user-specified allowed lateness the window is expired. All data related
-to an expired window may be discarded at any time.</p><h4 
id=coder>Coder</h4><p>Every PCollection has a coder, a specification of the 
binary format of the elements.</p><p>In Beam, the user&rsquo;s pipeline may be 
written in a language other than the
+into the global window. Transforms that write data and don&rsquo;t take this
+perspective probably risks data loss.</p><p>A window has a maximum timestamp. 
When the watermark exceeds the maximum
+timestamp plus the user-specified allowed lateness, the window is expired. All
+data related to an expired window might be discarded at any time.</p><h4 
id=coder>Coder</h4><p>Every <code>PCollection</code> has a coder, which is a 
specification of the binary format
+of the elements.</p><p>In Beam, the user&rsquo;s pipeline can be written in a 
language other than the
 language of the runner. There is no expectation that the runner can actually
-deserialize user data. So the Beam model operates principally on encoded data -
-&ldquo;just bytes&rdquo;. Each PCollection has a declared encoding for its 
elements, called
-a coder. A coder has a URN that identifies the encoding, and may have
-additional sub-coders (for example, a coder for lists may contain a coder for
-the elements of the list). Language-specific serialization techniques can, and
-frequently are used, but there are a few key formats - such as key-value pairs
-and timestamps - that are common so your runner can understand them.</p><h4 
id=windowing-strategy>Windowing Strategy</h4><p>Every PCollection has a 
windowing strategy, a specification of essential
-information for grouping and triggering operations.</p><p>The details will be 
discussed below when we discuss the
-<a href=#implementing-the-window-primitive>Window</a> primitive, which sets up 
the
-windowing strategy, and
-<a href=#implementing-the-groupbykey-and-window-primitive>GroupByKey</a> 
primitive,
-which has behavior governed by the windowing strategy.</p><h3 
id=user-defined-functions-udfs>User-Defined Functions (UDFs)</h3><p>Beam has 
seven varieties of user-defined function (UDF). A Beam pipeline
+deserialize user data. The Beam model operates principally on encoded data,
+&ldquo;just bytes&rdquo;. Each <code>PCollection</code> has a declared 
encoding for its elements,
+called a coder. A coder has a URN that identifies the encoding, and might have
+additional sub-coders. For example, a coder for lists might contain a coder for
+the elements of the list. Language-specific serialization techniques are
+frequently used, but there are a few common key formats (such as key-value 
pairs
+and timestamps) so the runner can understand them.</p><h4 
id=windowing-strategy>Windowing strategy</h4><p>Every <code>PCollection</code> 
has a windowing strategy, which is a specification of
+essential information for grouping and triggering operations. The 
<code>Window</code>
+transform sets up the windowing strategy, and the <code>GroupByKey</code> 
transform has
+behavior that is governed by the windowing strategy.</p><br><p>For more 
information about PCollections, see the following page:</p><ul><li><a 
href=/documentation/programming-guide/#pcollections>Beam Programming Guide: 
PCollections</a></li></ul><h3 id=user-defined-functions-udfs>User-Defined 
Functions (UDFs)</h3><p>Beam has seven varieties of user-defined function 
(UDF). A Beam pipeline
 may contain UDFs written in a language other than your runner, or even multiple
 languages in the same pipeline (see the <a href=#the-runner-api>Runner 
API</a>) so the
 definitions are language-independent (see the <a href=#the-fn-api>Fn 
API</a>).</p><p>The UDFs of Beam are:</p><ul><li><em>DoFn</em> - per-element 
processing function (used in ParDo)</li><li><em>WindowFn</em> - places elements 
in windows and merges windows (used in Window
@@ -92,7 +131,7 @@ use code font for proper nouns in our APIs, whether or not 
the identifiers
 match across all SDKs.</p><p>The <code>run(Pipeline)</code> method should be 
asynchronous and results in a
 PipelineResult which generally will be a job descriptor for your data
 processing engine, providing methods for checking its status, canceling it, and
-waiting for it to terminate.</p><div class=feedback><p class=update>Last 
updated on 2021/06/03</p><h3>Have you found everything you were looking 
for?</h3><p class=description>Was it all useful and clear? Is there anything 
that you would like to change? Let us know!</p><button class=load-button><a 
href="mailto:d...@beam.apache.org?subject=Beam Website Feedback">SEND 
FEEDBACK</a></button></div></div></div><footer class=footer><div 
class=footer__contained><div class=footer__cols><div class=" [...]
+waiting for it to terminate.</p><div class=feedback><p class=update>Last 
updated on 2021/10/21</p><h3>Have you found everything you were looking 
for?</h3><p class=description>Was it all useful and clear? Is there anything 
that you would like to change? Let us know!</p><button class=load-button><a 
href="mailto:d...@beam.apache.org?subject=Beam Website Feedback">SEND 
FEEDBACK</a></button></div></div></div><footer class=footer><div 
class=footer__contained><div class=footer__cols><div class=" [...]
 <a href=http://www.apache.org>The Apache Software Foundation</a>
 | <a href=/privacy_policy>Privacy Policy</a>
 | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam 
logo, and the Apache feather logo are either registered trademarks or 
trademarks of The Apache Software Foundation. All other products or name brands 
are trademarks of their respective holders, including The Apache Software 
Foundation.</div></div></div></div></footer></body></html>
\ No newline at end of file
diff --git a/website/generated-content/documentation/index.html 
b/website/generated-content/documentation/index.html
index 3dcde7e..ebd68a2 100644
--- a/website/generated-content/documentation/index.html
+++ b/website/generated-content/documentation/index.html
@@ -18,9 +18,32 @@
 function addPlaceholder(){$('input:text').attr('placeholder',"What are you 
looking for?");}
 function endSearch(){var 
search=document.querySelector(".searchBar");search.classList.add("disappear");var
 icons=document.querySelector("#iconsBar");icons.classList.remove("disappear");}
 function blockScroll(){$("body").toggleClass("fixedPosition");}
-function openMenu(){addPlaceholder();blockScroll();}</script><div 
class="clearfix container-main-content"><div class="section-nav closed" 
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back 
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list 
data-section-nav><li><span 
class=section-nav-list-main-title>Documentation</span></li><li><a 
href=/documentation>Using the Documentation</a></li><li 
class=section-nav-item--collapsible><span class=section-nav-lis [...]
+function openMenu(){addPlaceholder();blockScroll();}</script><div 
class="clearfix container-main-content"><div class="section-nav closed" 
data-offset-top=90 data-offset-bottom=500><span class="section-nav-back 
glyphicon glyphicon-menu-left"></span><nav><ul class=section-nav-list 
data-section-nav><li><span 
class=section-nav-list-main-title>Documentation</span></li><li><a 
href=/documentation>Using the Documentation</a></li><li 
class=section-nav-item--collapsible><span class=section-nav-lis [...]
+the Beam programming model, SDKs, and runners.</p><h2 
id=concepts>Concepts</h2><p>Learn about the Beam Programming Model and the 
concepts common to all Beam SDKs
+and Runners.</p><ul><li>Start with the <a href=/documentation/basics/>Basics 
of the Beam model</a> for
+introductory conceptual information.</li><li>Read the <a 
href=/documentation/programming-guide/>Programming Guide</a>, which
+has more detailed information about the Beam concepts and provides code
+snippets.</li><li>Learn about Beam&rsquo;s <a 
href=/documentation/runtime/model>execution model</a> to better
+understand how pipelines execute.</li><li>Visit <a 
href=/documentation/resources/learning-resources>Learning Resources</a> for
+some of our favorite articles and talks about Beam.</li><li>Reference the <a 
href=/documentation/glossary>glossary</a> to learn the terminology of the
+Beam programming model.</li></ul><h2 id=pipeline-fundamentals>Pipeline 
Fundamentals</h2><ul><li><a 
href=/documentation/pipelines/design-your-pipeline/>Design Your Pipeline</a> by
+planning your pipeline’s structure, choosing transforms to apply to your data,
+and determining your input and output methods.</li><li><a 
href=/documentation/pipelines/create-your-pipeline/>Create Your Pipeline</a> 
using
+the classes in the Beam SDKs.</li><li><a 
href=/documentation/pipelines/test-your-pipeline/>Test Your Pipeline</a> to 
minimize
+debugging a pipeline’s remote execution.</li></ul><h2 id=sdks>SDKs</h2><p>Find 
status and reference information on all of the available Beam SDKs.</p><div 
class=sdks><div class=item-description><a href=/documentation/sdks/java/ 
class=font-weight-bold>Java SDK</a><p></p></div><div class=item-description><a 
href=/documentation/sdks/python/ class=font-weight-bold>Python 
SDK</a><p></p></div><div class=item-description><a href=/documentation/sdks/go/ 
class=font-weight-bold>Go SDK</a><p></p></ [...]
+built-in transforms.</p><ul><li><a 
href=/documentation/transforms/java/overview/>Java transform 
catalog</a></li><li><a href=/documentation/transforms/python/overview/>Python 
transform catalog</a></li></ul><h2 id=runners>Runners</h2><p>A Beam Runner runs 
a Beam pipeline on a specific (often distributed) data
+processing system.</p><h3 id=available-runners>Available Runners</h3><div 
class="documentation-list mobile-column"><div class=row><div class=column><div 
class=item-icon><svg width="67" height="84" viewBox="0 0 67 84" fill="none" 
xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink";><rect x=".500977" y=".450195" 
width="188.235" height="83.4223" fill="url(#pattern5)"/><defs><pattern 
id="pattern5" patternContentUnits="objectBoundingBox" width="1" height="1"><use 
xlin [...]
 button.innerHTML="+ SHOW MORE";else
-button.innerHTML="- SHOW LESS";}</script><h3 id=choosing-a-runner>Choosing a 
Runner</h3><p>Beam is designed to enable pipelines to be portable across 
different runners. However, given every runner has different capabilities, they 
also have different abilities to implement the core concepts in the Beam model. 
The <a href=/documentation/runners/capability-matrix/>Capability Matrix</a> 
provides a detailed comparison of runner functionality.</p><p>Once you have 
chosen which runner to use, se [...]
+button.innerHTML="- SHOW LESS";}</script><h3 id=choosing-a-runner>Choosing a 
Runner</h3><p>Beam is designed to enable pipelines to be portable across 
different runners.
+However, given every runner has different capabilities, they also have 
different
+abilities to implement the core concepts in the Beam model. The
+<a href=/documentation/runners/capability-matrix/>Capability Matrix</a> 
provides a
+detailed comparison of runner functionality.</p><p>Once you have chosen which 
runner to use, see that runner&rsquo;s page for more
+information about any initial runner-specific setup as well as any required or
+optional <code>PipelineOptions</code> for configuring its execution. You might 
also want to
+refer back to the Quickstart for <a href=/get-started/quickstart-java>Java</a>,
+<a href=/get-started/quickstart-py>Python</a> or <a 
href=/get-started/quickstart-go>Go</a> for
+instructions on executing the sample WordCount pipeline.</p><div 
class=feedback><p class=update>Last updated on 2021/10/21</p><h3>Have you found 
everything you were looking for?</h3><p class=description>Was it all useful and 
clear? Is there anything that you would like to change? Let us know!</p><button 
class=load-button><a href="mailto:d...@beam.apache.org?subject=Beam Website 
Feedback">SEND FEEDBACK</a></button></div></div></div><footer class=footer><div 
class=footer__contained><div cla [...]
 <a href=http://www.apache.org>The Apache Software Foundation</a>
 | <a href=/privacy_policy>Privacy Policy</a>
 | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam 
logo, and the Apache feather logo are either registered trademarks or 
trademarks of The Apache Software Foundation. All other products or name brands 
are trademarks of their respective holders, including The Apache Software 
Foundation.</div></div></div></div></footer></body></html>
\ No newline at end of file
diff --git a/website/generated-content/documentation/index.xml 
b/website/generated-content/documentation/index.xml
index 17ed239..48ba7f5 100644
--- a/website/generated-content/documentation/index.xml
+++ b/website/generated-content/documentation/index.xml
@@ -3185,11 +3185,14 @@ of operations. You want to integrate it with the Beam 
ecosystem to get access
 to other languages, great event time processing, and a library of connectors.
 You need to know the core vocabulary:&lt;/p>
 &lt;ul>
-&lt;li>&lt;a href="#pipeline">&lt;em>Pipeline&lt;/em>&lt;/a> - A pipeline is a 
graph of transformations that a user constructs
-that defines the data processing they want to do.&lt;/li>
-&lt;li>&lt;a href="#pcollections">&lt;em>PCollection&lt;/em>&lt;/a> - Data 
being processed in a pipeline is part of a PCollection.&lt;/li>
-&lt;li>&lt;a href="#ptransforms">&lt;em>PTransforms&lt;/em>&lt;/a> - The 
operations executed within a pipeline. These are best
-thought of as operations on PCollections.&lt;/li>
+&lt;li>&lt;a href="#pipeline">&lt;em>Pipeline&lt;/em>&lt;/a> - A pipeline is a 
user-constructed graph of
+transformations that defines the desired data processing operations.&lt;/li>
+&lt;li>&lt;a href="#pcollection">&lt;em>PCollection&lt;/em>&lt;/a> - A 
&lt;code>PCollection&lt;/code> is a data set or data
+stream. The data that a pipeline processes is part of a PCollection.&lt;/li>
+&lt;li>&lt;a href="#ptransform">&lt;em>PTransform&lt;/em>&lt;/a> - A 
&lt;code>PTransform&lt;/code> (or &lt;em>transform&lt;/em>) represents a
+data processing operation, or a step, in your pipeline. A transform is
+applied to zero or more &lt;code>PCollection&lt;/code> objects, and produces 
zero or more
+&lt;code>PCollection&lt;/code> objects.&lt;/li>
 &lt;li>&lt;em>SDK&lt;/em> - A language-specific library for pipeline authors 
(we often call them
 &amp;ldquo;users&amp;rdquo; even though we have many kinds of users) to build 
transforms,
 construct their pipelines and submit them to a runner&lt;/li>
@@ -3201,79 +3204,146 @@ processing engine.&lt;/li>
 Beam&amp;rsquo;s design is for cross-language operation and reusable libraries 
of
 transforms, there are some special features worth highlighting.&lt;/p>
 &lt;h3 id="pipeline">Pipeline&lt;/h3>
-&lt;p>A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API&amp;rsquo;s
-RPC interfaces.&lt;/p>
-&lt;h3 id="ptransforms">PTransforms&lt;/h3>
-&lt;p>A &lt;code>PTransform&lt;/code> represents a data processing operation, 
or a step,
-in your pipeline. A &lt;code>PTransform&lt;/code> can be applied to one or more
-&lt;code>PCollection&lt;/code> objects as input which performs some processing 
on the elements of that
-&lt;code>PCollection&lt;/code> and produces zero or more output 
&lt;code>PCollection&lt;/code> objects.&lt;/p>
-&lt;h3 id="pcollections">PCollections&lt;/h3>
-&lt;p>A PCollection is an unordered bag of elements. Your runner will be 
responsible
-for storing these elements. There are some major aspects of a PCollection to
-note:&lt;/p>
-&lt;h4 id="bounded-vs-unbounded">Bounded vs Unbounded&lt;/h4>
-&lt;p>A PCollection may be bounded or unbounded.&lt;/p>
+&lt;p>A Beam pipeline is a graph (specifically, a
+&lt;a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph";>directed 
acyclic graph&lt;/a>)
+of all the data and computations in your data processing task. This includes
+reading input data, transforming that data, and writing output data. A pipeline
+is constructed by a user in their SDK of choice. Then, the pipeline makes its
+way to the runner either through the SDK directly or through the Runner 
API&amp;rsquo;s
+RPC interface. For example, this diagram shows a branching pipeline:&lt;/p>
+&lt;p>&lt;img src="/images/design-your-pipeline-multiple-pcollections.svg" 
alt="The pipeline applies two transforms to a single input collection. 
Eachtransform produces an output collection.">&lt;/p>
+&lt;p>In this diagram, the boxes represent the parallel computations called
+&lt;a href="#ptransform">&lt;em>PTransforms&lt;/em>&lt;/a> and the arrows with 
the circles represent the data
+(in the form of &lt;a href="#pcollection">&lt;em>PCollections&lt;/em>&lt;/a>) 
that flows between the
+transforms. The data might be bounded, stored, data sets, or the data might 
also
+be unbounded streams of data. In Beam, most transforms apply equally to bounded
+and unbounded data.&lt;/p>
+&lt;p>You can express almost any computation that you can think of as a graph 
as a
+Beam pipeline. A Beam driver program typically starts by creating a 
&lt;code>Pipeline&lt;/code>
+object, and then uses that object as the basis for creating the pipeline’s data
+sets and its transforms.&lt;/p>
+&lt;p>For more information about pipelines, see the following pages:&lt;/p>
+&lt;ul>
+&lt;li>&lt;a href="/documentation/programming-guide/#overview">Beam 
Programming Guide: Overview&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#creating-a-pipeline">Beam 
Programming Guide: Creating a pipeline&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/pipelines/design-your-pipeline">Design your 
pipeline&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/pipeline/create-your-pipeline">Create your 
pipeline&lt;/a>&lt;/li>
+&lt;/ul>
+&lt;h3 id="ptransform">PTransform&lt;/h3>
+&lt;p>A &lt;code>PTransform&lt;/code> (or transform) represents a data 
processing operation, or a step,
+in your pipeline. A transform is usually applied to one or more input
+&lt;code>PCollection&lt;/code> objects. Transforms that read input are an 
exception; these
+transforms might not have an input &lt;code>PCollection&lt;/code>.&lt;/p>
+&lt;p>You provide transform processing logic in the form of a function object
+(colloquially referred to as “user code”), and your user code is applied to 
each
+element of the input PCollection (or more than one PCollection). Depending on
+the pipeline runner and backend that you choose, many different workers across 
a
+cluster might execute instances of your user code in parallel. The user code
+that runs on each worker generates the output elements that are added to zero 
or
+more output &lt;code>PCollection&lt;/code> objects.&lt;/p>
+&lt;p>The Beam SDKs contain a number of different transforms that you can 
apply to
+your pipeline’s PCollections. These include general-purpose core transforms,
+such as &lt;code>ParDo&lt;/code> or &lt;code>Combine&lt;/code>. There are also 
pre-written composite transforms
+included in the SDKs, which combine one or more of the core transforms in a
+useful processing pattern, such as counting or combining elements in a
+collection. You can also define your own more complex composite transforms to
+fit your pipeline’s exact use case.&lt;/p>
+&lt;p>The following list has some common transform types:&lt;/p>
+&lt;ul>
+&lt;li>Source transforms such as &lt;code>TextIO.Read&lt;/code> and 
&lt;code>Create&lt;/code>. A source transform
+conceptually has no input.&lt;/li>
+&lt;li>Processing and conversion operations such as &lt;code>ParDo&lt;/code>, 
&lt;code>GroupByKey&lt;/code>,
+&lt;code>CoGroupByKey&lt;/code>, &lt;code>Combine&lt;/code>, and 
&lt;code>Count&lt;/code>.&lt;/li>
+&lt;li>Outputting transforms such as &lt;code>TextIO.Write&lt;/code>.&lt;/li>
+&lt;li>User-defined, application-specific composite transforms.&lt;/li>
+&lt;/ul>
+&lt;p>For more information about transforms, see the following pages:&lt;/p>
 &lt;ul>
-&lt;li>&lt;em>Bounded&lt;/em> - it is finite and you know it, as in batch use 
cases&lt;/li>
-&lt;li>&lt;em>Unbounded&lt;/em> - it may be never end, you don&amp;rsquo;t 
know, as in streaming use cases&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#overview">Beam 
Programming Guide: Overview&lt;/a>&lt;/li>
+&lt;li>&lt;a href="/documentation/programming-guide/#transforms">Beam 
Programming Guide: Transforms&lt;/a>&lt;/li>
+&lt;li>Beam transform catalog (&lt;a 
href="/documentation/transforms/java/overview/">Java&lt;/a>,
+&lt;a href="/documentation/transforms/python/overview/">Python&lt;/a>)&lt;/li>
 &lt;/ul>
-&lt;p>These derive from the intuitions of batch and stream processing, but the 
two
-are unified in Beam and bounded and unbounded PCollections can coexist in the
-same pipeline. If your runner can only support bounded PCollections, 
you&amp;rsquo;ll
-need to reject pipelines that contain unbounded PCollections. If your
-runner is only really targeting streams, there are adapters in our support code
-to convert everything to APIs targeting unbounded data.&lt;/p>
+&lt;h3 id="pcollection">PCollection&lt;/h3>
+&lt;p>A &lt;code>PCollection&lt;/code> is an unordered bag of elements. Each 
&lt;code>PCollection&lt;/code> is a
+potentially distributed, homogeneous data set or data stream, and is owned by
+the specific &lt;code>Pipeline&lt;/code> object for which it is created. 
Multiple pipelines
+cannot share a &lt;code>PCollection&lt;/code>. Beam pipelines process 
PCollections, and the
+runner is responsible for storing these elements.&lt;/p>
+&lt;p>A &lt;code>PCollection&lt;/code> generally contains &amp;ldquo;big 
data&amp;rdquo; (too much data to fit in memory on
+a single machine). Sometimes a small sample of data or an intermediate result
+might fit into memory on a single machine, but Beam&amp;rsquo;s computational 
patterns and
+transforms are focused on situations where distributed data-parallel 
computation
+is required. Therefore, the elements of a &lt;code>PCollection&lt;/code> 
cannot be processed
+individually, and are instead processed uniformly in parallel.&lt;/p>
+&lt;p>The following characteristics of a &lt;code>PCollection&lt;/code> are 
important to know.&lt;/p>
+&lt;h4 id="bounded-vs-unbounded">Bounded vs unbounded&lt;/h4>
+&lt;p>A &lt;code>PCollection&lt;/code> can be either bounded or 
unbounded.&lt;/p>
+&lt;ul>
+&lt;li>A &lt;em>bounded&lt;/em> &lt;code>PCollection&lt;/code> is a dataset of 
a known, fixed size (alternatively,
+a dataset that is not growing over time). Bounded data can be processed by
+batch pipelines.&lt;/li>
+&lt;li>An &lt;em>unbounded&lt;/em> &lt;code>PCollection&lt;/code> is a dataset 
that grows over time, and the
+elements are processed as they arrive. Unbounded data must be processed by
+streaming pipelines.&lt;/li>
+&lt;/ul>
+&lt;p>These two categories derive from the intuitions of batch and stream 
processing,
+but the two are unified in Beam and bounded and unbounded PCollections can
+coexist in the same pipeline. If your runner can only support bounded
+PCollections, you must reject pipelines that contain unbounded PCollections. If
+your runner is only targeting streams, there are adapters in Beam&amp;rsquo;s 
support code
+to convert everything to APIs that target unbounded data.&lt;/p>
 &lt;h4 id="timestamps">Timestamps&lt;/h4>
-&lt;p>Every element in a PCollection has a timestamp associated with it.&lt;/p>
-&lt;p>When you execute a primitive connector to some storage system, that 
connector
-is responsible for providing initial timestamps. Your runner will need to
-propagate and aggregate timestamps. If the timestamp is not important, as with
-certain batch processing jobs where elements do not denote events, they will be
-the minimum representable timestamp, often referred to colloquially as
-&amp;ldquo;negative infinity&amp;rdquo;.&lt;/p>
+&lt;p>Every element in a &lt;code>PCollection&lt;/code> has a timestamp 
associated with it.&lt;/p>
+&lt;p>When you execute a primitive connector to a storage system, that 
connector is
+responsible for providing initial timestamps. The runner must propagate and
+aggregate timestamps. If the timestamp is not important, such as with certain
+batch processing jobs where elements do not denote events, the timestamp will 
be
+the minimum representable timestamp, often referred to colloquially as 
&amp;ldquo;negative
+infinity&amp;rdquo;.&lt;/p>
 &lt;h4 id="watermarks">Watermarks&lt;/h4>
-&lt;p>Every PCollection has to have a watermark that estimates how complete the
-PCollection is.&lt;/p>
+&lt;p>Every &lt;code>PCollection&lt;/code> must have a watermark that 
estimates how complete the
+&lt;code>PCollection&lt;/code> is.&lt;/p>
 &lt;p>The watermark is a guess that &amp;ldquo;we&amp;rsquo;ll never see an 
element with an earlier
-timestamp&amp;rdquo;. Sources of data are responsible for producing a 
watermark. Your
-runner needs to implement watermark propagation as PCollections are processed,
-merged, and partitioned.&lt;/p>
-&lt;p>The contents of a PCollection are complete when a watermark advances to
-&amp;ldquo;infinity&amp;rdquo;. In this manner, you may discover that an 
unbounded PCollection is
+timestamp&amp;rdquo;. Data sources are responsible for producing a watermark. 
The runner
+must implement watermark propagation as PCollections are processed, merged, and
+partitioned.&lt;/p>
+&lt;p>The contents of a &lt;code>PCollection&lt;/code> are complete when a 
watermark advances to
+&amp;ldquo;infinity&amp;rdquo;. In this manner, you can discover that an 
unbounded PCollection is
 finite.&lt;/p>
 &lt;h4 id="windowed-elements">Windowed elements&lt;/h4>
-&lt;p>Every element in a PCollection resides in a window. No element resides in
-multiple windows (two elements can be equal except for their window, but they
-are not the same).&lt;/p>
-&lt;p>When elements are read from the outside world they arrive in the global 
window.
+&lt;p>Every element in a &lt;code>PCollection&lt;/code> resides in a window. 
No element resides in
+multiple windows; two elements can be equal except for their window, but they
+are not the same.&lt;/p>
+&lt;p>When elements are read from the outside world, they arrive in the global 
window.
 When they are written to the outside world, they are effectively placed back
-into the global window (any writing transform that doesn&amp;rsquo;t take this
-perspective probably risks data loss).&lt;/p>
-&lt;p>A window has a maximum timestamp, and when the watermark exceeds this 
plus
-user-specified allowed lateness the window is expired. All data related
-to an expired window may be discarded at any time.&lt;/p>
+into the global window. Transforms that write data and don&amp;rsquo;t take 
this
+perspective probably risks data loss.&lt;/p>
+&lt;p>A window has a maximum timestamp. When the watermark exceeds the maximum
+timestamp plus the user-specified allowed lateness, the window is expired. All
+data related to an expired window might be discarded at any time.&lt;/p>
 &lt;h4 id="coder">Coder&lt;/h4>
-&lt;p>Every PCollection has a coder, a specification of the binary format of 
the elements.&lt;/p>
-&lt;p>In Beam, the user&amp;rsquo;s pipeline may be written in a language 
other than the
+&lt;p>Every &lt;code>PCollection&lt;/code> has a coder, which is a 
specification of the binary format
+of the elements.&lt;/p>
+&lt;p>In Beam, the user&amp;rsquo;s pipeline can be written in a language 
other than the
 language of the runner. There is no expectation that the runner can actually
-deserialize user data. So the Beam model operates principally on encoded data -
-&amp;ldquo;just bytes&amp;rdquo;. Each PCollection has a declared encoding for 
its elements, called
-a coder. A coder has a URN that identifies the encoding, and may have
-additional sub-coders (for example, a coder for lists may contain a coder for
-the elements of the list). Language-specific serialization techniques can, and
-frequently are used, but there are a few key formats - such as key-value pairs
-and timestamps - that are common so your runner can understand them.&lt;/p>
-&lt;h4 id="windowing-strategy">Windowing Strategy&lt;/h4>
-&lt;p>Every PCollection has a windowing strategy, a specification of essential
-information for grouping and triggering operations.&lt;/p>
-&lt;p>The details will be discussed below when we discuss the
-&lt;a href="#implementing-the-window-primitive">Window&lt;/a> primitive, which 
sets up the
-windowing strategy, and
-&lt;a 
href="#implementing-the-groupbykey-and-window-primitive">GroupByKey&lt;/a> 
primitive,
-which has behavior governed by the windowing strategy.&lt;/p>
+deserialize user data. The Beam model operates principally on encoded data,
+&amp;ldquo;just bytes&amp;rdquo;. Each &lt;code>PCollection&lt;/code> has a 
declared encoding for its elements,
+called a coder. A coder has a URN that identifies the encoding, and might have
+additional sub-coders. For example, a coder for lists might contain a coder for
+the elements of the list. Language-specific serialization techniques are
+frequently used, but there are a few common key formats (such as key-value 
pairs
+and timestamps) so the runner can understand them.&lt;/p>
+&lt;h4 id="windowing-strategy">Windowing strategy&lt;/h4>
+&lt;p>Every &lt;code>PCollection&lt;/code> has a windowing strategy, which is 
a specification of
+essential information for grouping and triggering operations. The 
&lt;code>Window&lt;/code>
+transform sets up the windowing strategy, and the 
&lt;code>GroupByKey&lt;/code> transform has
+behavior that is governed by the windowing strategy.&lt;/p>
+&lt;br/>
+&lt;p>For more information about PCollections, see the following page:&lt;/p>
+&lt;ul>
+&lt;li>&lt;a href="/documentation/programming-guide/#pcollections">Beam 
Programming Guide: PCollections&lt;/a>&lt;/li>
+&lt;/ul>
 &lt;h3 id="user-defined-functions-udfs">User-Defined Functions (UDFs)&lt;/h3>
 &lt;p>Beam has seven varieties of user-defined function (UDF). A Beam pipeline
 may contain UDFs written in a language other than your runner, or even multiple
diff --git a/website/generated-content/sitemap.xml 
b/website/generated-content/sitemap.xml
index b9c3307..5fed307 100644
--- a/website/generated-content/sitemap.xml
+++ b/website/generated-content/sitemap.xml
@@ -1 +1 @@
-<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset 
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"; 
xmlns:xhtml="http://www.w3.org/1999/xhtml";><url><loc>/blog/beam-2.33.0/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/b
 [...]
\ No newline at end of file
+<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset 
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"; 
xmlns:xhtml="http://www.w3.org/1999/xhtml";><url><loc>/blog/beam-2.33.0/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/categories/</loc><lastmod>2021-10-11T18:22:03-07:00</lastmod></url><url><loc>/blog/b
 [...]
\ No newline at end of file

[beam] branch asf-site updated: Publishing website 2021/10/22 00:01:40 at commit 3d8213a

Reply via email to