documen...

ajothomas Wed, 18 Jan 2023 11:34:03 -0800

Modified: samza/site/learn/documentation/latest/index.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/index.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/index.html (original)
+++ samza/site/learn/documentation/latest/index.html Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a 
href="/learn/documentation/1.8.0/">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a 
href="/learn/documentation/1.7.0/">1.7.0</a></li>
+
+              
+
               <li class="hide"><a 
href="/learn/documentation/1.6.0/">1.6.0</a></li>


Modified: samza/site/learn/documentation/latest/introduction/architecture.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/introduction/architecture.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/introduction/architecture.html 
(original)
+++ samza/site/learn/documentation/latest/introduction/architecture.html Wed 
Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a 
href="/learn/documentation/1.8.0/introduction/architecture">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a 
href="/learn/documentation/1.7.0/introduction/architecture">1.7.0</a></li>
+
+              
+
               <li class="hide"><a 
href="/learn/documentation/1.6.0/introduction/architecture">1.6.0</a></li>
 
               
@@ -642,90 +656,90 @@
 <p>Samza is made up of three layers:</p>
 
 <ol>
-<li>A streaming layer.</li>
-<li>An execution layer.</li>
-<li>A processing layer.</li>
+  <li>A streaming layer.</li>
+  <li>An execution layer.</li>
+  <li>A processing layer.</li>
 </ol>
 
 <p>Samza provides out of the box support for all three layers.</p>
 
 <ol>
-<li><strong>Streaming:</strong> <a 
href="http://kafka.apache.org/";>Kafka</a></li>
-<li><strong>Execution:</strong> <a 
href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html";>YARN</a></li>
-<li><strong>Processing:</strong> <a href="../api/overview.html">Samza 
API</a></li>
+  <li><strong>Streaming:</strong> <a 
href="http://kafka.apache.org/";>Kafka</a></li>
+  <li><strong>Execution:</strong> <a 
href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html";>YARN</a></li>
+  <li><strong>Processing:</strong> <a href="../api/overview.html">Samza 
API</a></li>
 </ol>
 
 <p>These three pieces fit together to form Samza:</p>
 
-<p><img src="/img/latest/learn/documentation/introduction/samza-ecosystem.png" 
alt="diagram-medium"></p>
+<p><img src="/img/latest/learn/documentation/introduction/samza-ecosystem.png" 
alt="diagram-medium" /></p>
 
 <p>This architecture follows a similar pattern to Hadoop (which also uses YARN 
as execution layer, HDFS for storage, and MapReduce as processing API):</p>
 
-<p><img src="/img/latest/learn/documentation/introduction/samza-hadoop.png" 
alt="diagram-medium"></p>
+<p><img src="/img/latest/learn/documentation/introduction/samza-hadoop.png" 
alt="diagram-medium" /></p>
 
-<p>Before going in-depth on each of these three layers, it should be noted 
that Samza&rsquo;s support is not limited to Kafka and YARN. Both Samza&rsquo;s 
execution and streaming layer are pluggable, and allow developers to implement 
alternatives if they prefer.</p>
+<p>Before going in-depth on each of these three layers, it should be noted 
that Samzaâs support is not limited to Kafka and YARN. Both Samzaâs 
execution and streaming layer are pluggable, and allow developers to implement 
alternatives if they prefer.</p>
 
 <h3 id="kafka">Kafka</h3>
 
-<p><a href="http://kafka.apache.org/";>Kafka</a> is a distributed pub/sub and 
message queueing system that provides at-least once messaging guarantees (i.e. 
the system guarantees that no messages are lost, but in certain fault 
scenarios, a consumer might receive the same message more than once), and 
highly available partitions (i.e. a stream&rsquo;s partitions continue to be 
available even if a machine goes down).</p>
+<p><a href="http://kafka.apache.org/";>Kafka</a> is a distributed pub/sub and 
message queueing system that provides at-least once messaging guarantees (i.e. 
the system guarantees that no messages are lost, but in certain fault 
scenarios, a consumer might receive the same message more than once), and 
highly available partitions (i.e. a streamâs partitions continue to be 
available even if a machine goes down).</p>
 
 <p>In Kafka, each stream is called a <em>topic</em>. Each topic is partitioned 
and replicated across multiple machines called <em>brokers</em>. When a 
<em>producer</em> sends a message to a topic, it provides a key, which is used 
to determine which partition the message should be sent to. The Kafka brokers 
receive and store the messages that the producer sends. Kafka 
<em>consumers</em> can then read from a topic by subscribing to messages on all 
partitions of a topic.</p>
 
-<p>Kafka has some interesting properties: </p>
+<p>Kafka has some interesting properties:</p>
 
 <ul>
-<li>All messages with the same key are guaranteed to be in the same topic 
partition. This means that if you wish to read all messages for a specific user 
ID, you only have to read the messages from the partition that contains the 
user ID, not the whole topic (assuming the user ID is used as key).</li>
-<li>A topic partition is a sequence of messages in order of arrival, so you 
can reference any message in the partition using a monotonically increasing 
<em>offset</em> (like an index into an array). This means that the broker 
doesn&rsquo;t need to keep track of which messages have been seen by a 
particular consumer &mdash; the consumer can keep track itself by storing the 
offset of the last message it has processed. It then knows that every message 
with a lower offset than the current offset has already been processed; every 
message with a higher offset has not yet been processed.</li>
+  <li>All messages with the same key are guaranteed to be in the same topic 
partition. This means that if you wish to read all messages for a specific user 
ID, you only have to read the messages from the partition that contains the 
user ID, not the whole topic (assuming the user ID is used as key).</li>
+  <li>A topic partition is a sequence of messages in order of arrival, so you 
can reference any message in the partition using a monotonically increasing 
<em>offset</em> (like an index into an array). This means that the broker 
doesnât need to keep track of which messages have been seen by a particular 
consumer â the consumer can keep track itself by storing the offset of the 
last message it has processed. It then knows that every message with a lower 
offset than the current offset has already been processed; every message with a 
higher offset has not yet been processed.</li>
 </ul>
 
-<p>For more details on Kafka, see Kafka&rsquo;s <a 
href="http://kafka.apache.org/documentation.html";>documentation</a> pages.</p>
+<p>For more details on Kafka, see Kafkaâs <a 
href="http://kafka.apache.org/documentation.html";>documentation</a> pages.</p>
 
 <h3 id="yarn">YARN</h3>
 
-<p><a 
href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html";>YARN</a>
 (Yet Another Resource Negotiator) is Hadoop&rsquo;s next-generation cluster 
scheduler. It allows you to allocate a number of <em>containers</em> 
(processes) in a cluster of machines, and execute arbitrary commands on 
them.</p>
+<p><a 
href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html";>YARN</a>
 (Yet Another Resource Negotiator) is Hadoopâs next-generation cluster 
scheduler. It allows you to allocate a number of <em>containers</em> 
(processes) in a cluster of machines, and execute arbitrary commands on 
them.</p>
 
 <p>When an application interacts with YARN, it looks something like this:</p>
 
 <ol>
-<li><strong>Application</strong>: I want to run command X on two machines with 
512MB memory.</li>
-<li><strong>YARN</strong>: Cool, where&rsquo;s your code?</li>
-<li><strong>Application</strong>: http://path.to.host/jobs/download/my.tgz</li>
-<li><strong>YARN</strong>: I&rsquo;m running your job on node-1.grid and 
node-2.grid.</li>
+  <li><strong>Application</strong>: I want to run command X on two machines 
with 512MB memory.</li>
+  <li><strong>YARN</strong>: Cool, whereâs your code?</li>
+  <li><strong>Application</strong>: 
http://path.to.host/jobs/download/my.tgz</li>
+  <li><strong>YARN</strong>: Iâm running your job on node-1.grid and 
node-2.grid.</li>
 </ol>
 
 <p>Samza uses YARN to manage deployment, fault tolerance, logging, resource 
isolation, security, and locality. A brief overview of YARN is below; see <a 
href="http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/";>this
 page from Hortonworks</a> for a much better overview.</p>
 
 <h4 id="yarn-architecture">YARN Architecture</h4>
 
-<p>YARN has three important pieces: a <em>ResourceManager</em>, a 
<em>NodeManager</em>, and an <em>ApplicationMaster</em>. In a YARN grid, every 
machine runs a NodeManager, which is responsible for launching processes on 
that machine. A ResourceManager talks to all of the NodeManagers to tell them 
what to run. Applications, in turn, talk to the ResourceManager when they wish 
to run something on the cluster. The third piece, the ApplicationMaster, is 
actually application-specific code that runs in the YARN cluster. It&rsquo;s 
responsible for managing the application&rsquo;s workload, asking for 
containers (usually UNIX processes), and handling notifications when one of its 
containers fails.</p>
+<p>YARN has three important pieces: a <em>ResourceManager</em>, a 
<em>NodeManager</em>, and an <em>ApplicationMaster</em>. In a YARN grid, every 
machine runs a NodeManager, which is responsible for launching processes on 
that machine. A ResourceManager talks to all of the NodeManagers to tell them 
what to run. Applications, in turn, talk to the ResourceManager when they wish 
to run something on the cluster. The third piece, the ApplicationMaster, is 
actually application-specific code that runs in the YARN cluster. Itâs 
responsible for managing the applicationâs workload, asking for containers 
(usually UNIX processes), and handling notifications when one of its containers 
fails.</p>
 
 <h4 id="samza-and-yarn">Samza and YARN</h4>
 
 <p>Samza provides a YARN ApplicationMaster and a YARN job runner out of the 
box. The integration between Samza and YARN is outlined in the following 
diagram (different colors indicate different host machines):</p>
 
-<p><img 
src="/img/latest/learn/documentation/introduction/samza-yarn-integration.png" 
alt="diagram-small"></p>
+<p><img 
src="/img/latest/learn/documentation/introduction/samza-yarn-integration.png" 
alt="diagram-small" /></p>
 
-<p>The Samza client talks to the YARN RM when it wants to start a new Samza 
job. The YARN RM talks to a YARN NM to allocate space on the cluster for 
Samza&rsquo;s ApplicationMaster. Once the NM allocates space, it starts the 
Samza AM. After the Samza AM starts, it asks the YARN RM for one or more YARN 
containers to run <a 
href="../container/samza-container.html">SamzaContainers</a>. Again, the RM 
works with NMs to allocate space for the containers. Once the space has been 
allocated, the NMs start the Samza containers.</p>
+<p>The Samza client talks to the YARN RM when it wants to start a new Samza 
job. The YARN RM talks to a YARN NM to allocate space on the cluster for 
Samzaâs ApplicationMaster. Once the NM allocates space, it starts the Samza 
AM. After the Samza AM starts, it asks the YARN RM for one or more YARN 
containers to run <a 
href="../container/samza-container.html">SamzaContainers</a>. Again, the RM 
works with NMs to allocate space for the containers. Once the space has been 
allocated, the NMs start the Samza containers.</p>
 
 <h3 id="samza">Samza</h3>
 
 <p>Samza uses YARN and Kafka to provide a framework for stage-wise stream 
processing and partitioning. Everything, put together, looks like this 
(different colors indicate different host machines):</p>
 
-<p><img 
src="/img/latest/learn/documentation/introduction/samza-yarn-kafka-integration.png"
 alt="diagram-small"></p>
+<p><img 
src="/img/latest/learn/documentation/introduction/samza-yarn-kafka-integration.png"
 alt="diagram-small" /></p>
 
 <p>The Samza client uses YARN to run a Samza job: YARN starts and supervises 
one or more <a href="../container/samza-container.html">SamzaContainers</a>, 
and your processing code (using the <a 
href="../api/overview.html">StreamTask</a> API) runs inside those containers. 
The input and output for the Samza StreamTasks come from Kafka brokers that are 
(usually) co-located on the same machines as the YARN NMs.</p>
 
 <h3 id="example">Example</h3>
 
-<p>Let&rsquo;s take a look at a real example: suppose we want to count the 
number of page views. In SQL, you would write something like:</p>
+<p>Letâs take a look at a real example: suppose we want to count the number 
of page views. In SQL, you would write something like:</p>
 
-<figure class="highlight"><pre><code class="language-sql" 
data-lang="sql"><span></span><span class="k">SELECT</span> <span 
class="n">user_id</span><span class="p">,</span> <span 
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span 
class="p">)</span> <span class="k">FROM</span> <span 
class="n">PageViewEvent</span> <span class="k">GROUP</span> <span 
class="k">BY</span> <span class="n">user_id</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-sql" 
data-lang="sql"><span class="k">SELECT</span> <span 
class="n">user_id</span><span class="p">,</span> <span 
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span 
class="p">)</span> <span class="k">FROM</span> <span 
class="n">PageViewEvent</span> <span class="k">GROUP</span> <span 
class="k">BY</span> <span class="n">user_id</span></code></pre></figure>
 
-<p>Although Samza doesn&rsquo;t support SQL right now, the idea is the same. 
Two jobs are required to calculate this query: one to group messages by user 
ID, and the other to do the counting.</p>
+<p>Although Samza doesnât support SQL right now, the idea is the same. Two 
jobs are required to calculate this query: one to group messages by user ID, 
and the other to do the counting.</p>
 
-<p>In the first job, the grouping is done by sending all messages with the 
same user ID to the same partition of an intermediate topic. You can do this by 
using the user ID as key of the messages that are emitted by the first job, and 
this key is mapped to one of the intermediate topic&rsquo;s partitions (usually 
by taking a hash of the key mod the number of partitions). The second job 
consumes the intermediate topic. Each task in the second job consumes one 
partition of the intermediate topic, i.e. all the messages for a subset of user 
IDs. The task has a counter for each user ID in its partition, and the 
appropriate counter is incremented every time the task receives a message with 
a particular user ID.</p>
+<p>In the first job, the grouping is done by sending all messages with the 
same user ID to the same partition of an intermediate topic. You can do this by 
using the user ID as key of the messages that are emitted by the first job, and 
this key is mapped to one of the intermediate topicâs partitions (usually by 
taking a hash of the key mod the number of partitions). The second job consumes 
the intermediate topic. Each task in the second job consumes one partition of 
the intermediate topic, i.e. all the messages for a subset of user IDs. The 
task has a counter for each user ID in its partition, and the appropriate 
counter is incremented every time the task receives a message with a particular 
user ID.</p>
 
-<p><img 
src="/img/latest/learn/documentation/introduction/group-by-example.png" 
alt="Repartitioning for a GROUP BY" class="diagram-large"></p>
+<p><img 
src="/img/latest/learn/documentation/introduction/group-by-example.png" 
alt="Repartitioning for a GROUP BY" class="diagram-large" /></p>
 
 <p>If you are familiar with Hadoop, you may recognize this as a Map/Reduce 
operation, where each record is associated with a particular key in the 
mappers, records with the same key are grouped together by the framework, and 
then counted in the reduce step. The difference between Hadoop and Samza is 
that Hadoop operates on a fixed input, whereas Samza works with unbounded 
streams of data.</p>
 
@@ -733,7 +747,7 @@
 
 <p>By partitioning topics, and by breaking a stream process down into jobs and 
parallel tasks that run on multiple machines, Samza scales to streams with very 
high message throughput. By using YARN and Kafka, Samza achieves 
fault-tolerance: if a process or machine fails, it is automatically restarted 
on another machine and continues processing messages from the point where it 
left off.</p>
 
-<h2 id="comparison-introduction"><a 
href="../comparisons/introduction.html">Comparison Introduction &raquo;</a></h2>
+<h2 id="comparison-introduction-"><a 
href="../comparisons/introduction.html">Comparison Introduction Â»</a></h2>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/introduction/background.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/introduction/background.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/introduction/background.html 
(original)
+++ samza/site/learn/documentation/latest/introduction/background.html Wed Jan 
18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a 
href="/learn/documentation/1.8.0/introduction/background">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a 
href="/learn/documentation/1.7.0/introduction/background">1.7.0</a></li>
+
+              
+
               <li class="hide"><a 
href="/learn/documentation/1.6.0/introduction/background">1.6.0</a></li>
 
               
@@ -645,56 +659,56 @@
 
 <p>Messaging systems are a popular way of implementing near-realtime 
asynchronous computation. Messages can be added to a message queue (ActiveMQ, 
RabbitMQ), pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, 
Scribe) when something happens. Downstream <em>consumers</em> read messages 
from these systems, and process them or take actions based on the message 
contents.</p>
 
-<p>Suppose you have a website, and every time someone loads a page, you send a 
&ldquo;user viewed page&rdquo; event to a messaging system. You might then have 
consumers which do any of the following:</p>
+<p>Suppose you have a website, and every time someone loads a page, you send a 
âuser viewed pageâ event to a messaging system. You might then have 
consumers which do any of the following:</p>
 
 <ul>
-<li>Store the message in Hadoop for future analysis</li>
-<li>Count page views and update a dashboard</li>
-<li>Trigger an alert if a page view fails</li>
-<li>Send an email notification to another user</li>
-<li>Join the page view event with the user&rsquo;s profile, and send the 
message back to the messaging system</li>
+  <li>Store the message in Hadoop for future analysis</li>
+  <li>Count page views and update a dashboard</li>
+  <li>Trigger an alert if a page view fails</li>
+  <li>Send an email notification to another user</li>
+  <li>Join the page view event with the userâs profile, and send the message 
back to the messaging system</li>
 </ul>
 
 <p>A messaging system lets you decouple all of this work from the actual web 
page serving.</p>
 
 <h3 id="what-is-stream-processing">What is stream processing?</h3>
 
-<p>A messaging system is a fairly low-level piece of infrastructure&mdash;it 
stores messages and waits for consumers to consume them. When you start writing 
code that produces or consumes messages, you quickly find that there are a lot 
of tricky problems that have to be solved in the processing layer. Samza aims 
to help with these problems.</p>
+<p>A messaging system is a fairly low-level piece of infrastructureâit 
stores messages and waits for consumers to consume them. When you start writing 
code that produces or consumes messages, you quickly find that there are a lot 
of tricky problems that have to be solved in the processing layer. Samza aims 
to help with these problems.</p>
 
-<p>Consider the counting example, above (count page views and update a 
dashboard). What happens when the machine that your consumer is running on 
fails, and your current counter values are lost? How do you recover? Where 
should the processor be run when it restarts? What if the underlying messaging 
system sends you the same message twice, or loses a message? (Unless you are 
careful, your counts will be incorrect.) What if you want to count page views 
grouped by the page URL? How do you distribute the computation across multiple 
machines if it&rsquo;s too much for a single machine to handle?</p>
+<p>Consider the counting example, above (count page views and update a 
dashboard). What happens when the machine that your consumer is running on 
fails, and your current counter values are lost? How do you recover? Where 
should the processor be run when it restarts? What if the underlying messaging 
system sends you the same message twice, or loses a message? (Unless you are 
careful, your counts will be incorrect.) What if you want to count page views 
grouped by the page URL? How do you distribute the computation across multiple 
machines if itâs too much for a single machine to handle?</p>
 
-<p>Stream processing is a higher level of abstraction on top of messaging 
systems, and it&rsquo;s meant to address precisely this category of 
problems.</p>
+<p>Stream processing is a higher level of abstraction on top of messaging 
systems, and itâs meant to address precisely this category of problems.</p>
 
 <h3 id="samza">Samza</h3>
 
 <p>Samza is a stream processing framework with the following features:</p>
 
 <ul>
-<li><strong>Simple API:</strong> Unlike most low-level messaging system APIs, 
Samza provides a very simple callback-based &ldquo;process message&rdquo; API 
comparable to MapReduce.</li>
-<li><strong>Managed state:</strong> Samza manages snapshotting and restoration 
of a stream processor&rsquo;s state. When the processor is restarted, Samza 
restores its state to a consistent snapshot. Samza is built to handle large 
amounts of state (many gigabytes per partition).</li>
-<li><strong>Fault tolerance:</strong> Whenever a machine in the cluster fails, 
Samza works with YARN to transparently migrate your tasks to another 
machine.</li>
-<li><strong>Durability:</strong> Samza uses Kafka to guarantee that messages 
are processed in the order they were written to a partition, and that no 
messages are ever lost.</li>
-<li><strong>Scalability:</strong> Samza is partitioned and distributed at 
every level. Kafka provides ordered, partitioned, replayable, fault-tolerant 
streams. YARN provides a distributed environment for Samza containers to run 
in.</li>
-<li><strong>Pluggable:</strong> Though Samza works out of the box with Kafka 
and YARN, Samza provides a pluggable API that lets you run Samza with other 
messaging systems and execution environments.</li>
-<li><strong>Processor isolation:</strong> Samza works with Apache YARN, which 
supports Hadoop&rsquo;s security model, and resource isolation through Linux 
CGroups.</li>
+  <li><strong>Simple API:</strong> Unlike most low-level messaging system 
APIs, Samza provides a very simple callback-based âprocess messageâ API 
comparable to MapReduce.</li>
+  <li><strong>Managed state:</strong> Samza manages snapshotting and 
restoration of a stream processorâs state. When the processor is restarted, 
Samza restores its state to a consistent snapshot. Samza is built to handle 
large amounts of state (many gigabytes per partition).</li>
+  <li><strong>Fault tolerance:</strong> Whenever a machine in the cluster 
fails, Samza works with YARN to transparently migrate your tasks to another 
machine.</li>
+  <li><strong>Durability:</strong> Samza uses Kafka to guarantee that messages 
are processed in the order they were written to a partition, and that no 
messages are ever lost.</li>
+  <li><strong>Scalability:</strong> Samza is partitioned and distributed at 
every level. Kafka provides ordered, partitioned, replayable, fault-tolerant 
streams. YARN provides a distributed environment for Samza containers to run 
in.</li>
+  <li><strong>Pluggable:</strong> Though Samza works out of the box with Kafka 
and YARN, Samza provides a pluggable API that lets you run Samza with other 
messaging systems and execution environments.</li>
+  <li><strong>Processor isolation:</strong> Samza works with Apache YARN, 
which supports Hadoopâs security model, and resource isolation through Linux 
CGroups.</li>
 </ul>
 
 <h3 id="alternatives">Alternatives</h3>
 
-<p>The available open source stream processing systems are actually quite 
young, and no single system offers a complete solution. New problems in this 
area include: how a stream processor&rsquo;s state should be managed, whether 
or not a stream should be buffered remotely on disk, what to do when duplicate 
messages are received or messages are lost, and how to model underlying 
messaging systems.</p>
+<p>The available open source stream processing systems are actually quite 
young, and no single system offers a complete solution. New problems in this 
area include: how a stream processorâs state should be managed, whether or 
not a stream should be buffered remotely on disk, what to do when duplicate 
messages are received or messages are lost, and how to model underlying 
messaging systems.</p>
 
-<p>Samza&rsquo;s main differentiators are:</p>
+<p>Samzaâs main differentiators are:</p>
 
 <ul>
-<li>Samza supports fault-tolerant local state. State can be thought of as 
tables that are split up and co-located with the processing tasks. State is 
itself modeled as a stream. If the local state is lost due to machine failure, 
the state stream is replayed to restore it.</li>
-<li>Streams are ordered, partitioned, replayable, and fault tolerant.</li>
-<li>YARN is used for processor isolation, security, and fault tolerance.</li>
-<li>Jobs are decoupled: if one job goes slow and builds up a backlog of 
unprocessed messages, the rest of the system is not affected.</li>
+  <li>Samza supports fault-tolerant local state. State can be thought of as 
tables that are split up and co-located with the processing tasks. State is 
itself modeled as a stream. If the local state is lost due to machine failure, 
the state stream is replayed to restore it.</li>
+  <li>Streams are ordered, partitioned, replayable, and fault tolerant.</li>
+  <li>YARN is used for processor isolation, security, and fault tolerance.</li>
+  <li>Jobs are decoupled: if one job goes slow and builds up a backlog of 
unprocessed messages, the rest of the system is not affected.</li>
 </ul>
 
-<p>For a more in-depth discussion on Samza, and how it relates to other stream 
processing systems, have a look at Samza&rsquo;s <a 
href="../comparisons/introduction.html">Comparisons</a> documentation.</p>
+<p>For a more in-depth discussion on Samza, and how it relates to other stream 
processing systems, have a look at Samzaâs <a 
href="../comparisons/introduction.html">Comparisons</a> documentation.</p>
 
-<h2 id="concepts"><a href="concepts.html">Concepts &raquo;</a></h2>
+<h2 id="concepts-"><a href="concepts.html">Concepts Â»</a></h2>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/introduction/concepts.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/introduction/concepts.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/introduction/concepts.html (original)
+++ samza/site/learn/documentation/latest/introduction/concepts.html Wed Jan 18 
19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a 
href="/learn/documentation/1.8.0/introduction/concepts">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a 
href="/learn/documentation/1.7.0/introduction/concepts">1.7.0</a></li>
+
+              
+
               <li class="hide"><a 
href="/learn/documentation/1.6.0/introduction/concepts">1.6.0</a></li>
 
               
@@ -643,11 +657,11 @@
 
 <h3 id="streams">Streams</h3>
 
-<p>Samza processes <em>streams</em>. A stream is composed of immutable 
<em>messages</em> of a similar type or category. For example, a stream could be 
all the clicks on a website, or all the updates to a particular database table, 
or all the logs produced by a service, or any other type of event data. 
Messages can be appended to a stream or read from a stream. A stream can have 
any number of <em>consumers</em>, and reading from a stream doesn&rsquo;t 
delete the message (so each message is effectively broadcast to all consumers). 
Messages can optionally have an associated key which is used for partitioning, 
which we&rsquo;ll talk about in a second.</p>
+<p>Samza processes <em>streams</em>. A stream is composed of immutable 
<em>messages</em> of a similar type or category. For example, a stream could be 
all the clicks on a website, or all the updates to a particular database table, 
or all the logs produced by a service, or any other type of event data. 
Messages can be appended to a stream or read from a stream. A stream can have 
any number of <em>consumers</em>, and reading from a stream doesnât delete 
the message (so each message is effectively broadcast to all consumers). 
Messages can optionally have an associated key which is used for partitioning, 
which weâll talk about in a second.</p>
 
 <p>Samza supports pluggable <em>systems</em> that implement the stream 
abstraction: in <a href="https://kafka.apache.org/";>Kafka</a> a stream is a 
topic, in a database we might read a stream by consuming updates from a table, 
in Hadoop we might tail a directory of files in HDFS.</p>
 
-<p><img src="/img/latest/learn/documentation/introduction/job.png" 
alt="job"></p>
+<p><img src="/img/latest/learn/documentation/introduction/job.png" alt="job" 
/></p>
 
 <h3 id="jobs">Jobs</h3>
 
@@ -661,35 +675,35 @@
 
 <p>Each message in this sequence has an identifier called the <em>offset</em>, 
which is unique per partition. The offset can be a sequential integer, byte 
offset, or string depending on the underlying system implementation.</p>
 
-<p>When a message is appended to a stream, it is appended to only one of the 
stream&rsquo;s partitions. The assignment of the message to its partition is 
done with a key chosen by the writer. For example, if the user ID is used as 
the key, that ensures that all messages related to a particular user end up in 
the same partition.</p>
+<p>When a message is appended to a stream, it is appended to only one of the 
streamâs partitions. The assignment of the message to its partition is done 
with a key chosen by the writer. For example, if the user ID is used as the 
key, that ensures that all messages related to a particular user end up in the 
same partition.</p>
 
-<p><img src="/img/latest/learn/documentation/introduction/stream.png" 
alt="stream"></p>
+<p><img src="/img/latest/learn/documentation/introduction/stream.png" 
alt="stream" /></p>
 
 <h3 id="tasks">Tasks</h3>
 
-<p>A job is scaled by breaking it into multiple <em>tasks</em>. The 
<em>task</em> is the unit of parallelism of the job, just as the partition is 
to the stream. Each task consumes data from one partition for each of the 
job&rsquo;s input streams.</p>
+<p>A job is scaled by breaking it into multiple <em>tasks</em>. The 
<em>task</em> is the unit of parallelism of the job, just as the partition is 
to the stream. Each task consumes data from one partition for each of the 
jobâs input streams.</p>
 
 <p>A task processes messages from each of its input partitions sequentially, 
in the order of message offset. There is no defined ordering across partitions. 
This allows each task to operate independently. The YARN scheduler assigns each 
task to a machine, so the job as a whole can be distributed across many 
machines.</p>
 
-<p>The number of tasks in a job is determined by the number of input 
partitions (there cannot be more tasks than input partitions, or there would be 
some tasks with no input). However, you can change the computational resources 
assigned to the job (the amount of memory, number of CPU cores, etc.) to 
satisfy the job&rsquo;s needs. See notes on <em>containers</em> below.</p>
+<p>The number of tasks in a job is determined by the number of input 
partitions (there cannot be more tasks than input partitions, or there would be 
some tasks with no input). However, you can change the computational resources 
assigned to the job (the amount of memory, number of CPU cores, etc.) to 
satisfy the jobâs needs. See notes on <em>containers</em> below.</p>
 
 <p>The assignment of partitions to tasks never changes: if a task is on a 
machine that fails, the task is restarted elsewhere, still consuming the same 
stream partitions.</p>
 
-<p><img src="/img/latest/learn/documentation/introduction/job_detail.png" 
alt="job-detail"></p>
+<p><img src="/img/latest/learn/documentation/introduction/job_detail.png" 
alt="job-detail" /></p>
 
 <h3 id="dataflow-graphs">Dataflow Graphs</h3>
 
 <p>We can compose multiple jobs to create a dataflow graph, where the edges 
are streams containing data, and the nodes are jobs performing transformations. 
This composition is done purely through the streams the jobs take as input and 
output. The jobs are otherwise totally decoupled: they need not be implemented 
in the same code base, and adding, removing, or restarting a downstream job 
will not impact an upstream job.</p>
 
-<p>These graphs are often acyclic&mdash;that is, data usually doesn&rsquo;t 
flow from a job, through other jobs, back to itself. However, it is possible to 
create cyclic graphs if you need to.</p>
+<p>These graphs are often acyclicâthat is, data usually doesnât flow from 
a job, through other jobs, back to itself. However, it is possible to create 
cyclic graphs if you need to.</p>
 
-<p><img src="/img/latest/learn/documentation/introduction/dag.png" width="430" 
alt="Directed acyclic job graph"></p>
+<p><img src="/img/latest/learn/documentation/introduction/dag.png" width="430" 
alt="Directed acyclic job graph" /></p>
 
 <h3 id="containers">Containers</h3>
 
-<p>Partitions and tasks are both <em>logical</em> units of 
parallelism&mdash;they don&rsquo;t correspond to any particular assignment of 
computational resources (CPU, memory, disk space, etc). Containers are the unit 
of physical parallelism, and a container is essentially a Unix process (or 
Linux <a href="http://en.wikipedia.org/wiki/Cgroups";>cgroup</a>). Each 
container runs one or more tasks. The number of tasks is determined 
automatically from the number of partitions in the input and is fixed, but the 
number of containers (and the CPU and memory resources associated with them) is 
specified by the user at run time and can be changed at any time.</p>
+<p>Partitions and tasks are both <em>logical</em> units of parallelismâthey 
donât correspond to any particular assignment of computational resources 
(CPU, memory, disk space, etc). Containers are the unit of physical 
parallelism, and a container is essentially a Unix process (or Linux <a 
href="http://en.wikipedia.org/wiki/Cgroups";>cgroup</a>). Each container runs 
one or more tasks. The number of tasks is determined automatically from the 
number of partitions in the input and is fixed, but the number of containers 
(and the CPU and memory resources associated with them) is specified by the 
user at run time and can be changed at any time.</p>
 
-<h2 id="architecture"><a href="architecture.html">Architecture &raquo;</a></h2>
+<h2 id="architecture-"><a href="architecture.html">Architecture Â»</a></h2>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/jobs/configuration-table.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/jobs/configuration-table.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/jobs/configuration-table.html 
(original)
+++ samza/site/learn/documentation/latest/jobs/configuration-table.html Wed Jan 
18 19:33:25 2023
@@ -1912,6 +1912,22 @@
                 </tr>
 
                 <tr>
+                    <td class="property" 
id="stores-rocksdb-max-open-files">stores.<span 
class="store">store-name</span>.<br>rocksdb.max.open.files</td>
+                    <td class="default">-1</td>
+                    <td class="description">
+                        Limits the number of open files that RocksDB can have 
open at one time.
+                    </td>
+                </tr>
+
+                <tr>
+                    <td class="property" 
id="stores-rocksdb-max-file-opening-threads">stores.<span 
class="store">store-name</span>.<br>rocksdb.max.file.opening.threads</td>
+                    <td class="default">16</td>
+                    <td class="description">
+                        Sets the number of threads used to open RocksDB files.
+                    </td>
+                </tr>
+
+                <tr>
                     <td class="property" 
id="stores-rocksdb-metrics">stores.<span 
class="store">store-name</span>.<br>rocksdb.metrics.list</td>
                     <td class="default"></td>
                     <td class="description">

Modified: samza/site/learn/documentation/latest/jobs/configuration.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/jobs/configuration.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/jobs/configuration.html (original)
+++ samza/site/learn/documentation/latest/jobs/configuration.html Wed Jan 18 
19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a 
href="/learn/documentation/1.8.0/jobs/configuration">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a 
href="/learn/documentation/1.7.0/jobs/configuration">1.7.0</a></li>
+
+              
+
               <li class="hide"><a 
href="/learn/documentation/1.6.0/jobs/configuration">1.6.0</a></li>
 
               
@@ -640,47 +654,47 @@
 -->
 
 <p>All Samza applications have a <a 
href="https://en.wikipedia.org/wiki/.properties";>properties format</a> file 
that defines its configurations.
-A complete list of configuration keys can be found on the <a 
href="samza-configurations.html"><strong>Samza Configurations 
Table</strong></a> page. </p>
+A complete list of configuration keys can be found on the <a 
href="samza-configurations.html"><strong>Samza Configurations 
Table</strong></a> page.</p>
 
 <p>A very basic configuration file looks like this:</p>
 
-<figure class="highlight"><pre><code class="language-jproperties" 
data-lang="jproperties"><span></span><span class="c"># Application 
Configurations</span>
-<span class="na">job.factory.class</span><span class="o">=</span><span 
class="s">org.apache.samza.job.local.YarnJobFactory</span>
-<span class="na">app.name</span><span class="o">=</span><span 
class="s">hello-world</span>
-<span class="na">job.default.system</span><span class="o">=</span><span 
class="s">example-system</span>
-<span class="na">serializers.registry.json.class</span><span 
class="o">=</span><span 
class="s">org.apache.samza.serializers.JsonSerdeFactory</span>
-<span class="na">serializers.registry.string.class</span><span 
class="o">=</span><span 
class="s">org.apache.samza.serializers.StringSerdeFactory</span>
-
-<span class="c"># Systems &amp; Streams Configurations</span>
-<span class="na">systems.example-system.samza.factory</span><span 
class="o">=</span><span 
class="s">samza.stream.example.ExampleConsumerFactory</span>
-<span class="na">systems.example-system.samza.key.serde</span><span 
class="o">=</span><span class="s">string</span>
-<span class="na">systems.example-system.samza.msg.serde</span><span 
class="o">=</span><span class="s">json</span>
-
-<span class="c"># Checkpointing</span>
-<span class="na">task.checkpoint.factory</span><span class="o">=</span><span 
class="s">org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory</span>
-
-<span class="c"># State Storage</span>
-<span class="na">stores.example-store.factory</span><span 
class="o">=</span><span 
class="s">org.apache.samza.storage.kv.RocksDbKeyValueStorageEngineFactory</span>
-<span class="na">stores.example-store.key.serde</span><span 
class="o">=</span><span class="s">string</span>
-<span class="na">stores.example-store.value.serde</span><span 
class="o">=</span><span class="s">json</span>
-
-<span class="c"># Metrics</span>
-<span class="na">metrics.reporter.example-reporter.class</span><span 
class="o">=</span><span 
class="s">org.apache.samza.metrics.reporter.JmxReporterFactory</span>
-<span class="na">metrics.reporters</span><span class="o">=</span><span 
class="s">example-reporter</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-jproperties" 
data-lang="jproperties"># Application Configurations
+job.factory.class=org.apache.samza.job.local.YarnJobFactory
+app.name=hello-world
+job.default.system=example-system
+serializers.registry.json.class=org.apache.samza.serializers.JsonSerdeFactory
+serializers.registry.string.class=org.apache.samza.serializers.StringSerdeFactory
+
+# Systems &amp; Streams Configurations
+systems.example-system.samza.factory=samza.stream.example.ExampleConsumerFactory
+systems.example-system.samza.key.serde=string
+systems.example-system.samza.msg.serde=json
+
+# Checkpointing
+task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
+
+# State Storage
+stores.example-store.factory=org.apache.samza.storage.kv.RocksDbKeyValueStorageEngineFactory
+stores.example-store.key.serde=string
+stores.example-store.value.serde=json
+
+# Metrics
+metrics.reporter.example-reporter.class=org.apache.samza.metrics.reporter.JmxReporterFactory
+metrics.reporters=example-reporter</code></pre></figure>
 
 <p>There are 6 sections sections to a configuration file:</p>
 
 <ol>
-<li>The <a 
href="samza-configurations.html#application-configurations"><strong>Application</strong></a>
 section defines things like the name of the job, job factory (See the 
job.factory.class property in <a href="samza-configurations.html">Configuration 
Table</a>), the class name for your <a 
href="../api/overview.html">StreamTask</a> and serialization and 
deserialization of specific objects that are received and sent along different 
streams.</li>
-<li>The <a href="samza-configurations.html#systems-streams"><strong>Systems 
&amp; Streams</strong></a> section defines systems that your StreamTask can 
read from along with the types of serdes used for sending keys and messages 
from that system. You may use any of the <a 
href="../connectors/overview.html">predefined systems</a> that Samza ships 
with, although you can also specify your own self-implemented Samza-compatible 
systems. See the <a href="/startup/hello-samza/latest">hello-samza example 
project</a>&lsquo;s Wikipedia system for a good example of a self-implemented 
system.</li>
-<li>The <a 
href="samza-configurations.html#checkpointing"><strong>Checkpointing</strong></a>
 section defines how the messages processing state is saved, which provides 
fault-tolerant processing of streams (See <a 
href="../container/checkpointing.html">Checkpointing</a> for more details).</li>
-<li>The <a href="samza-configurations.html#state-storage"><strong>State 
Storage</strong></a> section defines the <a 
href="../container/state-management.html">stateful stream processing</a> 
settings for Samza.</li>
-<li>The <a 
href="samza-configurations.html#deployment"><strong>Deployment</strong></a> 
section defines how the Samza application will be deployed (To a cluster 
manager (YARN), or as a standalone library) as well as settings for each 
option. See <a href="/deployment/deployment-model.html">Deployment Models</a> 
for more details.</li>
-<li>The <a 
href="samza-configurations.html#metrics"><strong>Metrics</strong></a> section 
defines how the Samza application metrics will be monitored and collected. (See 
<a href="../operations/monitoring.html">Monitoring</a>)</li>
+  <li>The <a 
href="samza-configurations.html#application-configurations"><strong>Application</strong></a>
 section defines things like the name of the job, job factory (See the 
job.factory.class property in <a href="samza-configurations.html">Configuration 
Table</a>), the class name for your <a 
href="../api/overview.html">StreamTask</a> and serialization and 
deserialization of specific objects that are received and sent along different 
streams.</li>
+  <li>The <a href="samza-configurations.html#systems-streams"><strong>Systems 
&amp; Streams</strong></a> section defines systems that your StreamTask can 
read from along with the types of serdes used for sending keys and messages 
from that system. You may use any of the <a 
href="../connectors/overview.html">predefined systems</a> that Samza ships 
with, although you can also specify your own self-implemented Samza-compatible 
systems. See the <a href="/startup/hello-samza/latest">hello-samza example 
project</a>âs Wikipedia system for a good example of a self-implemented 
system.</li>
+  <li>The <a 
href="samza-configurations.html#checkpointing"><strong>Checkpointing</strong></a>
 section defines how the messages processing state is saved, which provides 
fault-tolerant processing of streams (See <a 
href="../container/checkpointing.html">Checkpointing</a> for more details).</li>
+  <li>The <a href="samza-configurations.html#state-storage"><strong>State 
Storage</strong></a> section defines the <a 
href="../container/state-management.html">stateful stream processing</a> 
settings for Samza.</li>
+  <li>The <a 
href="samza-configurations.html#deployment"><strong>Deployment</strong></a> 
section defines how the Samza application will be deployed (To a cluster 
manager (YARN), or as a standalone library) as well as settings for each 
option. See <a href="/deployment/deployment-model.html">Deployment Models</a> 
for more details.</li>
+  <li>The <a 
href="samza-configurations.html#metrics"><strong>Metrics</strong></a> section 
defines how the Samza application metrics will be monitored and collected. (See 
<a href="../operations/monitoring.html">Monitoring</a>)</li>
 </ol>
 
-<p>Note that configuration keys prefixed with <code>sensitive.</code> are 
treated specially, in that the values associated with such keys
-will be masked in logs and Samza&rsquo;s YARN ApplicationMaster UI.  This is 
to prevent accidental disclosure only; no
+<p>Note that configuration keys prefixed with <code class="language-plaintext 
highlighter-rouge">sensitive.</code> are treated specially, in that the values 
associated with such keys
+will be masked in logs and Samzaâs YARN ApplicationMaster UI.  This is to 
prevent accidental disclosure only; no
 encryption is done.</p>

svn commit: r1906774 [34/49] - in /samza/site: ./ archive/ blog/ case-studies/ community/ contribute/ img/latest/learn/documentation/api/ learn/documentation/latest/ learn/documentation/latest/api/ learn/documentation/latest/api/javadocs/ learn/documen...

Reply via email to