documen...

ajothomas Wed, 18 Jan 2023 11:34:08 -0800

Modified: samza/site/learn/documentation/latest/yarn/yarn-host-affinity.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/yarn/yarn-host-affinity.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/yarn/yarn-host-affinity.html 
(original)
+++ samza/site/learn/documentation/latest/yarn/yarn-host-affinity.html Wed Jan 
18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a 
href="/learn/documentation/1.8.0/yarn/yarn-host-affinity">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a 
href="/learn/documentation/1.7.0/yarn/yarn-host-affinity">1.7.0</a></li>
+
+              
+
               <li class="hide"><a 
href="/learn/documentation/1.6.0/yarn/yarn-host-affinity">1.6.0</a></li>
 
               
@@ -644,12 +658,12 @@
 <p>We define a <em>Stateful Samza Job</em> as the Samza job that uses a 
key-value store in its implementation, along with an associated changelog 
stream. In stateful samza jobs, a task may be configured to use multiple 
stores. For each store there is a 1:1 mapping between the task instance and the 
data store. Since the allocation of containers to machines in the Yarn cluster 
is completely left to Yarn, Samza does not guarantee that a container (and 
hence, its associated task(s)) gets deployed on the same machine. Containers 
can get shuffled in any of the following cases:</p>
 
 <ol>
-<li>When a job is upgraded by pointing <code>yarn.package.path</code> to the 
new package path and re-submitted.</li>
-<li>When a job is simply restarted by Yarn or the user</li>
-<li>When a container failure or premption triggers the SamzaAppMaster to 
re-allocate on another available resource</li>
+  <li>When a job is upgraded by pointing <code>yarn.package.path</code> to the 
new package path and re-submitted.</li>
+  <li>When a job is simply restarted by Yarn or the user</li>
+  <li>When a container failure or premption triggers the SamzaAppMaster to 
re-allocate on another available resource</li>
 </ol>
 
-<p>In any of the above cases, the task&rsquo;s co-located data needs to be 
restored every time a container starts-up. Restoring data each time can be 
expensive, especially for applications that have a large data set. This 
behavior slows the start-up time for the job so much that the job is no longer 
&ldquo;near realtime&rdquo;. Furthermore, if multiple stateful samza jobs 
restart around the same time in the cluster and they all share the same 
changelog system, then it is possible to quickly saturate the changelog 
system&rsquo;s network and cause a DDoS.</p>
+<p>In any of the above cases, the taskâs co-located data needs to be 
restored every time a container starts-up. Restoring data each time can be 
expensive, especially for applications that have a large data set. This 
behavior slows the start-up time for the job so much that the job is no longer 
ânear realtimeâ. Furthermore, if multiple stateful samza jobs restart 
around the same time in the cluster and they all share the same changelog 
system, then it is possible to quickly saturate the changelog systemâs 
network and cause a DDoS.</p>
 
 <p>For instance, consider a Samza job performing a Stream-Table join. 
Typically, such a job requires the dataset to be available on all processors 
before they begin processing the input stream. The dataset is usually large 
(order &gt; 1TB) read-only data that will be used to join or add attributes to 
incoming messages. The job may initialize this cache by populating it with data 
directly from a remote store or changelog stream. This cache initialization 
happens each time the container is restarted. This causes significant latency 
during job start-up.</p>
 
@@ -657,91 +671,97 @@
 
 <h2 id="how-does-it-work">How does it work?</h2>
 
-<p>When a stateful Samza job is deployed in Yarn, the state stores for the 
tasks are co-located in the current working directory of Yarn&rsquo;s 
application attempt.</p>
+<p>When a stateful Samza job is deployed in Yarn, the state stores for the 
tasks are co-located in the current working directory of Yarnâs application 
attempt.</p>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span><span 
class="nv">container_working_dir</span><span class="o">=</span><span 
class="si">${</span><span class="nv">yarn</span><span 
class="p">.nodemanager.local-dirs</span><span 
class="si">}</span>/usercache/<span class="si">${</span><span 
class="nv">user</span><span class="si">}</span>/appcache/application_<span 
class="si">${</span><span class="nv">appid</span><span 
class="si">}</span>/container_<span class="si">${</span><span 
class="nv">contid</span><span class="si">}</span>/
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span class="nv">container_working_dir</span><span 
class="o">=</span><span class="k">${</span><span class="nv">yarn</span><span 
class="p">.nodemanager.local-dirs</span><span 
class="k">}</span>/usercache/<span class="k">${</span><span 
class="nv">user</span><span class="k">}</span>/appcache/application_<span 
class="k">${</span><span class="nv">appid</span><span 
class="k">}</span>/container_<span class="k">${</span><span 
class="nv">contid</span><span class="k">}</span>/
 
-<span class="c1"># Data Stores</span>
-ls <span class="si">${</span><span 
class="nv">container_working_dir</span><span class="si">}</span>/state/<span 
class="si">${</span><span class="nv">store</span><span 
class="p">-name</span><span class="si">}</span>/<span class="si">${</span><span 
class="nv">task_name</span><span class="si">}</span>/</code></pre></figure>
+<span class="c"># Data Stores</span>
+<span class="nb">ls</span> <span class="k">${</span><span 
class="nv">container_working_dir</span><span class="k">}</span>/state/<span 
class="k">${</span><span class="nv">store</span><span 
class="p">-name</span><span class="k">}</span>/<span class="k">${</span><span 
class="nv">task_name</span><span class="k">}</span>/</code></pre></figure>
 
-<p>This allows the Node Manager&rsquo;s (NM) DeletionService to clean-up the 
working directory once the application completes or fails. In order to re-use 
local state store, the state store needs to be persisted outside the scope of 
NM&rsquo;s deletion service. The cluster administrator should set this location 
as an environment variable in Yarn - <code>LOGGED_STORE_BASE_DIR</code>.</p>
+<p>This allows the Node Managerâs (NM) DeletionService to clean-up the 
working directory once the application completes or fails. In order to re-use 
local state store, the state store needs to be persisted outside the scope of 
NMâs deletion service. The cluster administrator should set this location as 
an environment variable in Yarn - <code>LOGGED\_STORE\_BASE\_DIR</code>.</p>
 
-<p><img src="/img/latest/learn/documentation/yarn/samza-host-affinity.png" 
alt="Yarn host affinity component diagram" style="max-width: 100%; height: 
auto;" onclick="window.open(this.src)"/></p>
+<p><img src="/img/latest/learn/documentation/yarn/samza-host-affinity.png" 
alt="Yarn host affinity component diagram" style="max-width: 100%; height: 
auto;" onclick="window.open(this.src)" /></p>
 
-<p>Each time a task commits, Samza writes the last materialized offset from 
the changelog stream to the checksumed file on disk. This is also done on 
container shutdown. Thus, there is an <em>OFFSET</em> file associated with each 
state stores&rsquo; changelog partitions, that is consumed by the tasks in the 
container.</p>
+<p>Each time a task commits, Samza writes the last materialized offset from 
the changelog stream to the checksumed file on disk. This is also done on 
container shutdown. Thus, there is an <em>OFFSET</em> file associated with each 
state storesâ changelog partitions, that is consumed by the tasks in the 
container.</p>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span><span class="si">${</span><span 
class="nv">LOGGED_STORE_BASE_DIR</span><span class="si">}</span>/<span 
class="si">${</span><span class="nv">job</span><span 
class="p">.name</span><span class="si">}</span>-<span class="si">${</span><span 
class="nv">job</span><span class="p">.id</span><span class="si">}</span>/<span 
class="si">${</span><span class="nv">store</span><span 
class="p">.name</span><span class="si">}</span>/<span class="si">${</span><span 
class="nv">task</span><span class="p">.name</span><span 
class="si">}</span>/OFFSET</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span class="k">${</span><span 
class="nv">LOGGED_STORE_BASE_DIR</span><span class="k">}</span>/<span 
class="k">${</span><span class="nv">job</span><span class="p">.name</span><span 
class="k">}</span>-<span class="k">${</span><span class="nv">job</span><span 
class="p">.id</span><span class="k">}</span>/<span class="k">${</span><span 
class="nv">store</span><span class="p">.name</span><span 
class="k">}</span>/<span class="k">${</span><span class="nv">task</span><span 
class="p">.name</span><span class="k">}</span>/OFFSET</code></pre></figure>
 
 <p>Now, when a container restarts on the same machine after the OFFSET file 
exists, the Samza container:</p>
 
 <ol>
-<li>Opens the persisted store on disk</li>
-<li>Reads the OFFSET file</li>
-<li>Restores the state store from the OFFSET value</li>
+  <li>Opens the persisted store on disk</li>
+  <li>Reads the OFFSET file</li>
+  <li>Restores the state store from the OFFSET value</li>
 </ol>
 
-<p>This significantly reduces the state restoration time on container start-up 
as we no longer consume from the beginning of the changelog stream. If the 
OFFSET file doesn&rsquo;t exist, it creates the state store and consumes from 
the oldest offset in the changelog to re-create the state. Since the OFFSET 
file is written on each commit after flushing the store, the recorded offset is 
guaranteed to correspond to the current contents of the store or some older 
point, but never newer. This gives at least once semantics for state restore. 
Therefore, the changelog entries must be idempotent.</p>
+<p>This significantly reduces the state restoration time on container start-up 
as we no longer consume from the beginning of the changelog stream. If the 
OFFSET file doesnât exist, it creates the state store and consumes from the 
oldest offset in the changelog to re-create the state. Since the OFFSET file is 
written on each commit after flushing the store, the recorded offset is 
guaranteed to correspond to the current contents of the store or some older 
point, but never newer. This gives at least once semantics for state restore. 
Therefore, the changelog entries must be idempotent.</p>
 
 <p>It is necessary to periodically clean-up unused or orphaned state stores on 
the machines to manage disk-space. This feature is being worked on in <a 
href="https://issues.apache.org/jira/browse/SAMZA-656";>SAMZA-656</a>.</p>
 
-<p>In order to re-use local state, Samza has to sucessfully claim the specific 
hosts from the Resource Manager (RM). To support this, the Samza containers 
write their locality information to the <a 
href="../container/coordinator-stream.html">Coordinator Stream</a> every time 
they start-up successfully. Now, the Samza Application Master (AM) can identify 
the last known host of a container via the <a 
href="../container/coordinator-stream.html">Job Coordinator</a>(JC) and the 
application is no longer agnostic of the container locality. On a container 
failure (due to any of the above cited reasons), the AM includes the hostname 
of the expected resource in the <a 
href="https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L239%5D";>ResourceRequest</a>.</p>
+<p>In order to re-use local state, Samza has to sucessfully claim the specific 
hosts from the Resource Manager (RM). To support this, the Samza containers 
write their locality information to the <a 
href="../container/coordinator-stream.html">Coordinator Stream</a> every time 
they start-up successfully. Now, the Samza Application Master (AM) can identify 
the last known host of a container via the <a 
href="../container/coordinator-stream.html">Job Coordinator</a>(JC) and the 
application is no longer agnostic of the container locality. On a container 
failure (due to any of the above cited reasons), the AM includes the hostname 
of the expected resource in the <a 
href="https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L239]";>ResourceRequest</a>.</p>
 
-<p>Note that the Yarn cluster has to be configured to use <a 
href="https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html";>Fair
 Scheduler</a> with continuous-scheduling enabled. With continuous scheduling, 
the scheduler continuously iterates through all nodes in the cluster, instead 
of relying on the nodes&rsquo; heartbeat, and schedules work based on 
previously known status for each node, before relaxing locality. Hence, the 
scheduler takes care of relaxing locality after the configured delay. This 
approach can be considered as a &ldquo;<em>best-effort stickiness</em>&rdquo; 
policy because it is possible that the requested node is not running or does 
not have sufficient resources at the time of request (even though the state in 
the data stores may be persisted). For more details on the choice of Fair 
Scheduler, please refer the <a 
href="https://issues.apache.org/jira/secure/attachment/12726945/DESIGN-SAMZA-617-2.pdf";>design
 doc</a>.</p>
+<p>Note that the Yarn cluster has to be configured to use <a 
href="https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html";>Fair
 Scheduler</a> with continuous-scheduling enabled. With continuous scheduling, 
the scheduler continuously iterates through all nodes in the cluster, instead 
of relying on the nodesâ heartbeat, and schedules work based on previously 
known status for each node, before relaxing locality. Hence, the scheduler 
takes care of relaxing locality after the configured delay. This approach can 
be considered as a â<em>best-effort stickiness</em>â policy because it is 
possible that the requested node is not running or does not have sufficient 
resources at the time of request (even though the state in the data stores may 
be persisted). For more details on the choice of Fair Scheduler, please refer 
the <a 
href="https://issues.apache.org/jira/secure/attachment/12726945/DESIGN-SAMZA-617-2.pdf";>design
 doc</a>.</p>
 
 <h2 id="configuring-yarn-cluster-to-support-host-affinity">Configuring YARN 
cluster to support Host Affinity</h2>
 
 <ol>
-<li>Enable local state re-use by setting the 
<code>LOGGED_STORE_BASE_DIR</code> environment variable in yarn-env.sh 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span><span class="nb">export</span> <span 
class="nv">LOGGED<em>STORE</em>BASE_DIR</span><span 
class="o">=</span>&lt;path-for-state-stores&gt;</code></pre></figure>
-Without this configuration, the state stores are not persisted upon a 
container shutdown. This will effectively mean you will not re-use local state 
and hence, host-affinity becomes a moot operation.</li>
-<li><p>Configure Yarn to use Fair Scheduler and enable continuous-scheduling 
in yarn-site.xml 
-<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span></span><span class="nt">&lt;property&gt;</span>
-<span class="nt">&lt;name&gt;</span>yarn.resourcemanager.scheduler.class<span 
class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>The class to use as the resource 
scheduler.<span class="nt">&lt;/description&gt;</span>
-<span 
class="nt">&lt;value&gt;</span>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler<span
 class="nt">&lt;/value&gt;</span>
+  <li>Enable local state re-use by setting the 
<code>LOGGED\_STORE\_BASE\_DIR</code> environment variable in yarn-env.sh</li>
+</ol>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span class="nb">export </span><span 
class="nv">LOGGED_STORE_BASE_DIR</span><span 
class="o">=</span>&lt;path-for-state-stores&gt;</code></pre></figure>
+<p>Without this configuration, the state stores are not persisted upon a 
container shutdown. This will effectively mean you will not re-use local state 
and hence, host-affinity becomes a moot operation.</p>
+<ol>
+  <li>Configure Yarn to use Fair Scheduler and enable continuous-scheduling in 
yarn-site.xml</li>
+</ol>
+<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span class="nt">&lt;property&gt;</span>
+    <span 
class="nt">&lt;name&gt;</span>yarn.resourcemanager.scheduler.class<span 
class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>The class to use as the 
resource scheduler.<span class="nt">&lt;/description&gt;</span>
+    <span 
class="nt">&lt;value&gt;</span>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler<span
 class="nt">&lt;/value&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;property&gt;</span>
-<span 
class="nt">&lt;name&gt;</span>yarn.scheduler.fair.continuous-scheduling-enabled<span
 class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>Enable Continuous Scheduling of 
Resource Requests<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>true<span class="nt">&lt;/value&gt;</span>
+    <span 
class="nt">&lt;name&gt;</span>yarn.scheduler.fair.continuous-scheduling-enabled<span
 class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>Enable Continuous Scheduling of 
Resource Requests<span class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>true<span 
class="nt">&lt;/value&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;property&gt;</span>
-<span 
class="nt">&lt;name&gt;</span>yarn.scheduler.fair.locality-delay-node-ms<span 
class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>Delay time in milliseconds before 
relaxing locality at node-level<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>1000<span 
class="nt">&lt;/value&gt;</span>  <span class="c">&lt;!-- Should be tuned per 
requirement --&gt;</span>
+    <span 
class="nt">&lt;name&gt;</span>yarn.scheduler.fair.locality-delay-node-ms<span 
class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>Delay time in milliseconds 
before relaxing locality at node-level<span 
class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>1000<span 
class="nt">&lt;/value&gt;</span>  <span class="c">&lt;!-- Should be tuned per 
requirement --&gt;</span>
 <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;property&gt;</span>
-<span 
class="nt">&lt;name&gt;</span>yarn.scheduler.fair.locality-delay-rack-ms<span 
class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>Delay time in milliseconds before 
relaxing locality at rack-level<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>1000<span 
class="nt">&lt;/value&gt;</span> <span class="c">&lt;!-- Should be tuned per 
requirement --&gt;</span>
-<span class="nt">&lt;/property&gt;</span></code></pre></figure></p></li>
-<li><p>Configure Yarn Node Manager SIGTERM to SIGKILL timeout to be reasonable 
time s.t. Node Manager will give Samza Container enough time to perform a clean 
shutdown in yarn-site.xml 
-<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span></span><span class="nt">&lt;property&gt;</span>
-<span 
class="nt">&lt;name&gt;</span>yarn.nodemanager.sleep-delay-before-sigkill.ms<span
 class="nt">&lt;/name&gt;</span>
-<span class="nt">&lt;description&gt;</span>No. of ms to wait between sending a 
SIGTERM and SIGKILL to a container<span class="nt">&lt;/description&gt;</span>
-<span class="nt">&lt;value&gt;</span>600000<span 
class="nt">&lt;/value&gt;</span> <span class="c">&lt;!-- Set it to 10min to 
allow enough time for clean shutdown of containers --&gt;</span>
-<span class="nt">&lt;/property&gt;</span></code></pre></figure></p></li>
-<li><p>The Yarn <a 
href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html";>Rack
 Awareness</a> feature is not required and does not change the behavior of 
Samza Host Affinity. However, if Rack Awareness is configured in the cluster, 
make sure the DNSToSwitchMapping implementation is robust. Any failures could 
cause container requests to fall back to the defaultRack. This will cause 
ContainerRequests to not match the preferred host, which will degrade Host 
Affinity. For details, see <a 
href="https://issues.apache.org/jira/browse/SAMZA-886";>SAMZA-866</a></p></li>
+    <span 
class="nt">&lt;name&gt;</span>yarn.scheduler.fair.locality-delay-rack-ms<span 
class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>Delay time in milliseconds 
before relaxing locality at rack-level<span 
class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>1000<span 
class="nt">&lt;/value&gt;</span> <span class="c">&lt;!-- Should be tuned per 
requirement --&gt;</span>
+<span class="nt">&lt;/property&gt;</span></code></pre></figure>
+
+<ol>
+  <li>Configure Yarn Node Manager SIGTERM to SIGKILL timeout to be reasonable 
time s.t. Node Manager will give Samza Container enough time to perform a clean 
shutdown in yarn-site.xml</li>
 </ol>
+<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span class="nt">&lt;property&gt;</span>
+    <span 
class="nt">&lt;name&gt;</span>yarn.nodemanager.sleep-delay-before-sigkill.ms<span
 class="nt">&lt;/name&gt;</span>
+    <span class="nt">&lt;description&gt;</span>No. of ms to wait between 
sending a SIGTERM and SIGKILL to a container<span 
class="nt">&lt;/description&gt;</span>
+    <span class="nt">&lt;value&gt;</span>600000<span 
class="nt">&lt;/value&gt;</span> <span class="c">&lt;!-- Set it to 10min to 
allow enough time for clean shutdown of containers --&gt;</span>
+<span class="nt">&lt;/property&gt;</span></code></pre></figure>
 
-<h2 id="configuring-a-samza-job-to-use-host-affinity">Configuring a Samza job 
to use Host Affinity</h2>
+<ol>
+  <li>The Yarn <a 
href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html";>Rack
 Awareness</a> feature is not required and does not change the behavior of 
Samza Host Affinity. However, if Rack Awareness is configured in the cluster, 
make sure the DNSToSwitchMapping implementation is robust. Any failures could 
cause container requests to fall back to the defaultRack. This will cause 
ContainerRequests to not match the preferred host, which will degrade Host 
Affinity. For details, see <a 
href="https://issues.apache.org/jira/browse/SAMZA-886";>SAMZA-866</a></li>
+</ol>
 
+<h2 id="configuring-a-samza-job-to-use-host-affinity">Configuring a Samza job 
to use Host Affinity</h2>
 <p>Any stateful Samza job can leverage this feature to reduce the Mean Time To 
Restore (MTTR) of its state stores by setting 
<code>yarn.samza.host-affinity.enabled</code> to true.</p>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span>yarn.samza.host-affinity.enabled<span 
class="o">=</span><span class="nb">true</span>  <span class="c1"># Default: 
false</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash">yarn.samza.host-affinity.enabled<span class="o">=</span><span 
class="nb">true</span>  <span class="c"># Default: 
false</span></code></pre></figure>
 
 <p>Enabling this feature for a stateless Samza job should not have any adverse 
effect on the job.</p>
 
 <h2 id="host-affinity-guarantees">Host-affinity Guarantees</h2>
-
 <p>As you have observed, host-affinity cannot be guaranteed all the time due 
to varibale load distribution in the Yarn cluster. Hence, this is a best-effort 
policy that Samza provides. However, certain scenarios are worth calling out 
where these guarantees may be hard to achieve or are not applicable.</p>
 
 <ol>
-<li><em>When the number of containers and/or container-task assignment changes 
across successive application runs</em> - We may be able to re-use local state 
for a subset of partitions. Currently, there is no logic in the Job Coordinator 
to handle partitioning of tasks among containers intelligently. Handling this 
is more involved as relates to <a 
href="https://issues.apache.org/jira/browse/SAMZA-336";>auto-scaling</a> of the 
containers. However, with <a 
href="https://issues.apache.org/jira/browse/SAMZA-906";>task-container 
mapping</a>, this will work better for typical container count adjustments.</li>
-<li><em>When SystemStreamPartitionGrouper changes across successive 
application runs</em> - When the grouper logic used to distribute the 
partitions across containers changes, the data in the Coordinator Stream (for 
changelog-task partition assignment etc) and the data stores becomes invalid. 
Thus, to be safe, we should flush out all state-related data from the 
Coordinator Stream. An alternative is to overwrite the Task-ChangelogPartition 
assignment message and the Container Locality message in the Coordinator 
Stream, before starting up the job again.</li>
+  <li><em>When the number of containers and/or container-task assignment 
changes across successive application runs</em> - We may be able to re-use 
local state for a subset of partitions. Currently, there is no logic in the Job 
Coordinator to handle partitioning of tasks among containers intelligently. 
Handling this is more involved as relates to <a 
href="https://issues.apache.org/jira/browse/SAMZA-336";>auto-scaling</a> of the 
containers. However, with <a 
href="https://issues.apache.org/jira/browse/SAMZA-906";>task-container 
mapping</a>, this will work better for typical container count adjustments.</li>
+  <li><em>When SystemStreamPartitionGrouper changes across successive 
application runs</em> - When the grouper logic used to distribute the 
partitions across containers changes, the data in the Coordinator Stream (for 
changelog-task partition assignment etc) and the data stores becomes invalid. 
Thus, to be safe, we should flush out all state-related data from the 
Coordinator Stream. An alternative is to overwrite the Task-ChangelogPartition 
assignment message and the Container Locality message in the Coordinator 
Stream, before starting up the job again.</li>
 </ol>
 
-<h2 id="resource-localization"><a 
href="../yarn/yarn-resource-localization.html">Resource Localization 
&raquo;</a></h2>
+<h2 id="resource-localization-"><a 
href="../yarn/yarn-resource-localization.html">Resource Localization Â»</a></h2>
 
            
         </div>


Modified: 
samza/site/learn/documentation/latest/yarn/yarn-resource-localization.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/yarn/yarn-resource-localization.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/yarn/yarn-resource-localization.html 
(original)
+++ samza/site/learn/documentation/latest/yarn/yarn-resource-localization.html 
Wed Jan 18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a 
href="/learn/documentation/1.8.0/yarn/yarn-resource-localization">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a 
href="/learn/documentation/1.7.0/yarn/yarn-resource-localization">1.7.0</a></li>
+
+              
+
               <li class="hide"><a 
href="/learn/documentation/1.6.0/yarn/yarn-resource-localization">1.6.0</a></li>
 
               
@@ -638,80 +652,75 @@
    See the License for the specific language governing permissions and
    limitations under the License.
 -->
-
 <p>When running Samza jobs on YARN clusters, you may need to download some 
resources before startup (For example, downloading the job binaries, fetching 
certificate files etc.) This step is called as Resource Localization.</p>
 
 <h3 id="resource-localization-process">Resource Localization Process</h3>
 
-<p>For Samza jobs running on YARN, resource localization leverages the YARN 
node manager&rsquo;s localization service. Here is a <a 
href="https://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/";>deep
 dive</a> on how localization works in YARN. </p>
+<p>For Samza jobs running on YARN, resource localization leverages the YARN 
node managerâs localization service. Here is a <a 
href="https://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/";>deep
 dive</a> on how localization works in YARN.</p>
 
-<p>Depending on where and how the resource comes from, fetching the resource 
is associated with a scheme in the path (such as <code>http</code>, 
<code>https</code>, <code>hdfs</code>, <code>ftp</code>, <code>file</code>, 
etc). The scheme maps to a corresponding <code>FileSystem</code> implementation 
for handling the localization. </p>
+<p>Depending on where and how the resource comes from, fetching the resource 
is associated with a scheme in the path (such as <code 
class="language-plaintext highlighter-rouge">http</code>, <code 
class="language-plaintext highlighter-rouge">https</code>, <code 
class="language-plaintext highlighter-rouge">hdfs</code>, <code 
class="language-plaintext highlighter-rouge">ftp</code>, <code 
class="language-plaintext highlighter-rouge">file</code>, etc). The scheme maps 
to a corresponding <code class="language-plaintext 
highlighter-rouge">FileSystem</code> implementation for handling the 
localization.</p>
 
-<p>There are some predefined <code>FileSystem</code> implementations in Hadoop 
and Samza, which are provided if you run Samza jobs on YARN:</p>
+<p>There are some predefined <code class="language-plaintext 
highlighter-rouge">FileSystem</code> implementations in Hadoop and Samza, which 
are provided if you run Samza jobs on YARN:</p>
 
 <ul>
-<li><code>org.apache.samza.util.hadoop.HttpFileSystem</code>: used for 
fetching resources based on http or https without client side 
authentication.</li>
-<li><code>org.apache.hadoop.hdfs.DistributedFileSystem</code>: used for 
fetching resource from DFS system on Hadoop.</li>
-<li><code>org.apache.hadoop.fs.LocalFileSystem</code>: used for copying 
resources from local file system to the job directory.</li>
-<li><code>org.apache.hadoop.fs.ftp.FTPFileSystem</code>: used for fetching 
resources based on ftp.</li>
+  <li><code class="language-plaintext 
highlighter-rouge">org.apache.samza.util.hadoop.HttpFileSystem</code>: used for 
fetching resources based on http or https without client side 
authentication.</li>
+  <li><code class="language-plaintext 
highlighter-rouge">org.apache.hadoop.hdfs.DistributedFileSystem</code>: used 
for fetching resource from DFS system on Hadoop.</li>
+  <li><code class="language-plaintext 
highlighter-rouge">org.apache.hadoop.fs.LocalFileSystem</code>: used for 
copying resources from local file system to the job directory.</li>
+  <li><code class="language-plaintext 
highlighter-rouge">org.apache.hadoop.fs.ftp.FTPFileSystem</code>: used for 
fetching resources based on ftp.</li>
 </ul>
 
-<p>You can create your own file system implementation by creating a class 
which extends from <code>org.apache.hadoop.fs.FileSystem</code>. </p>
+<p>You can create your own file system implementation by creating a class 
which extends from <code class="language-plaintext 
highlighter-rouge">org.apache.hadoop.fs.FileSystem</code>.</p>
 
 <h3 id="resource-configuration">Resource Configuration</h3>
-
 <p>You can specify a resource to be localized by the following 
configuration.</p>
 
 <h4 id="required-configuration">Required Configuration</h4>
-
 <ol>
-<li><code>yarn.resources.&lt;resourceName&gt;.path</code>
-
-<ul>
-<li>The path for fetching the resource for localization, e.g. 
http://hostname.com/packages/myResource</li>
-</ul></li>
+  <li><code class="language-plaintext 
highlighter-rouge">yarn.resources.&lt;resourceName&gt;.path</code>
+    <ul>
+      <li>The path for fetching the resource for localization, e.g. 
http://hostname.com/packages/myResource</li>
+    </ul>
+  </li>
 </ol>
 
 <h4 id="optional-configuration">Optional Configuration</h4>
-
 <ol>
-<li><code>yarn.resources.&lt;resourceName&gt;.local.name</code>
-
-<ul>
-<li>The local name used for the localized resource.</li>
-<li>If it is not set, the default will be the 
<code>&lt;resourceName&gt;</code> specified in 
<code>yarn.resources.&lt;resourceName&gt;.path</code></li>
-</ul></li>
-<li><code>yarn.resources.&lt;resourceName&gt;.local.type</code>
-
-<ul>
-<li>The type of the resource with valid values from: <code>ARCHIVE</code>, 
<code>FILE</code>, <code>PATTERN</code>.
-
-<ul>
-<li>ARCHIVE: the localized resource will be an archived directory;</li>
-<li>FILE: the localized resource will be a file;</li>
-<li>PATTERN: the localized resource will be the entries extracted from the 
archive with the pattern.</li>
-</ul></li>
-<li>If it is not set, the default value is <code>FILE</code>.</li>
-</ul></li>
-<li><code>yarn.resources.&lt;resourceName&gt;.local.visibility</code>
-
-<ul>
-<li>Visibility for the resource with valid values from <code>PUBLIC</code>, 
<code>PRIVATE</code>, <code>APPLICATION</code>
-
-<ul>
-<li>PUBLIC: visible to everyone </li>
-<li>PRIVATE: visible to just the account which runs the job</li>
-<li>APPLICATION: visible only to the specific application job which has the 
resource configuration</li>
-</ul></li>
-<li>If it is not set, the default value is <code>APPLICATION</code></li>
-</ul></li>
+  <li><code class="language-plaintext 
highlighter-rouge">yarn.resources.&lt;resourceName&gt;.local.name</code>
+    <ul>
+      <li>The local name used for the localized resource.</li>
+      <li>If it is not set, the default will be the <code 
class="language-plaintext highlighter-rouge">&lt;resourceName&gt;</code> 
specified in <code class="language-plaintext 
highlighter-rouge">yarn.resources.&lt;resourceName&gt;.path</code></li>
+    </ul>
+  </li>
+  <li><code class="language-plaintext 
highlighter-rouge">yarn.resources.&lt;resourceName&gt;.local.type</code>
+    <ul>
+      <li>The type of the resource with valid values from: <code 
class="language-plaintext highlighter-rouge">ARCHIVE</code>, <code 
class="language-plaintext highlighter-rouge">FILE</code>, <code 
class="language-plaintext highlighter-rouge">PATTERN</code>.
+        <ul>
+          <li>ARCHIVE: the localized resource will be an archived 
directory;</li>
+          <li>FILE: the localized resource will be a file;</li>
+          <li>PATTERN: the localized resource will be the entries extracted 
from the archive with the pattern.</li>
+        </ul>
+      </li>
+      <li>If it is not set, the default value is <code 
class="language-plaintext highlighter-rouge">FILE</code>.</li>
+    </ul>
+  </li>
+  <li><code class="language-plaintext 
highlighter-rouge">yarn.resources.&lt;resourceName&gt;.local.visibility</code>
+    <ul>
+      <li>Visibility for the resource with valid values from <code 
class="language-plaintext highlighter-rouge">PUBLIC</code>, <code 
class="language-plaintext highlighter-rouge">PRIVATE</code>, <code 
class="language-plaintext highlighter-rouge">APPLICATION</code>
+        <ul>
+          <li>PUBLIC: visible to everyone</li>
+          <li>PRIVATE: visible to just the account which runs the job</li>
+          <li>APPLICATION: visible only to the specific application job which 
has the resource configuration</li>
+        </ul>
+      </li>
+      <li>If it is not set, the default value is <code 
class="language-plaintext highlighter-rouge">APPLICATION</code></li>
+    </ul>
+  </li>
 </ol>
 
 <h3 id="yarn-configuration">YARN Configuration</h3>
+<p>Make sure the scheme used in the <code class="language-plaintext 
highlighter-rouge">yarn.resources.&lt;resourceName&gt;.path</code> is 
configured with a corresponding FileSystem implementation in YARN 
core-site.xml.</p>
 
-<p>Make sure the scheme used in the 
<code>yarn.resources.&lt;resourceName&gt;.path</code> is configured with a 
corresponding FileSystem implementation in YARN core-site.xml.</p>
-
-<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span></span><span class="cp">&lt;?xml-stylesheet 
type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;</span>
+<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span class="cp">&lt;?xml-stylesheet type="text/xsl" 
href="configuration.xsl"?&gt;</span>
 <span class="nt">&lt;configuration&gt;</span>
     <span class="nt">&lt;property&gt;</span>
       <span class="nt">&lt;name&gt;</span>fs.http.impl<span 
class="nt">&lt;/name&gt;</span>
@@ -719,9 +728,9 @@
     <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;/configuration&gt;</span></code></pre></figure>
 
-<p>If you are using your own scheme (for example, 
<code>yarn.resources.myResource.path=myScheme://host.com/test</code>), you can 
link your <a 
href="https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html";>FileSystem</a>
 implementation with it as follows.</p>
+<p>If you are using your own scheme (for example, <code 
class="language-plaintext 
highlighter-rouge">yarn.resources.myResource.path=myScheme://host.com/test</code>),
 you can link your <a 
href="https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html";>FileSystem</a>
 implementation with it as follows.</p>
 
-<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span></span><span class="cp">&lt;?xml-stylesheet 
type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;</span>
+<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span class="cp">&lt;?xml-stylesheet type="text/xsl" 
href="configuration.xsl"?&gt;</span>
 <span class="nt">&lt;configuration&gt;</span>
     <span class="nt">&lt;property&gt;</span>
       <span class="nt">&lt;name&gt;</span>fs.myScheme.impl<span 
class="nt">&lt;/name&gt;</span>
@@ -729,7 +738,7 @@
     <span class="nt">&lt;/property&gt;</span>
 <span class="nt">&lt;/configuration&gt;</span></code></pre></figure>
 
-<h2 id="yarn-security"><a href="../yarn/yarn-security.html">Yarn Security 
&raquo;</a></h2>
+<h2 id="yarn-security-"><a href="../yarn/yarn-security.html">Yarn Security 
Â»</a></h2>
 
            
         </div>

Modified: samza/site/learn/documentation/latest/yarn/yarn-security.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/documentation/latest/yarn/yarn-security.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/documentation/latest/yarn/yarn-security.html (original)
+++ samza/site/learn/documentation/latest/yarn/yarn-security.html Wed Jan 18 
19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -538,6 +544,14 @@
               
               
 
+              <li class="hide"><a 
href="/learn/documentation/1.8.0/yarn/yarn-security">1.8.0</a></li>
+
+              
+
+              <li class="hide"><a 
href="/learn/documentation/1.7.0/yarn/yarn-security">1.7.0</a></li>
+
+              
+
               <li class="hide"><a 
href="/learn/documentation/1.6.0/yarn/yarn-security">1.6.0</a></li>
 
               
@@ -646,59 +660,86 @@
 <p>One of the challenges for long-lived application running on a secure YARN 
cluster is its token renewal strategy. Samza takes the following approach to 
manage token creation and renewal.</p>
 
 <ol>
-<li><p>Client running Samza app needs to kinit into KDC with his credentials 
and add the HDFS delegation tokens to the launcher context before submitting 
the application.</p></li>
-<li><p>Next, client prepares the local resources for the application as 
follows.
+  <li>
+    <p>Client running Samza app needs to kinit into KDC with his credentials 
and add the HDFS delegation tokens to the launcher context before submitting 
the application.</p>
+  </li>
+  <li>
+    <p>Next, client prepares the local resources for the application as 
follows.
 2.1. First, it creates a staging directory on HDFS. This directory is only 
accessible by the running user and used to store resources required for 
Application Master (AM) and Containers.
 2.2. Client then adds the keytab as a local resource in the container launcher 
context for AM.
-2.3. Finally, it sends the corresponding principal and the path to the keytab 
file in the staging directory to the coordinator stream. Samza currently uses 
the staging directory to store both the keytab and refreshed tokens because the 
access to the directory is secured via Kerberos.</p></li>
-<li><p>Once the resource is allocated for the Application Master, the Node 
Manager will localizes app resources from HDFS using the HDFS delegation tokens 
in the launcher context. Same rule applies to Container localization too. 
</p></li>
-<li><p>When Application Master starts, it localizes the keytab file into its 
working directory and reads the principal from the coordinator stream.</p></li>
-<li><p>The Application Master periodically re-authenticate itself with the 
given principal and keytab. In each iteration, it creates new delegation tokens 
and stores them in the given job specific staging directory on HDFS.</p></li>
-<li><p>Each running container will get new delegation tokens from the 
credentials file on HDFS before the current ones expire.</p></li>
-<li><p>Application Master and Containers don&rsquo;t communicate with each 
other for that matter. Each side proceeds independently by reading or writing 
the tokens on HDFS.</p></li>
+2.3. Finally, it sends the corresponding principal and the path to the keytab 
file in the staging directory to the coordinator stream. Samza currently uses 
the staging directory to store both the keytab and refreshed tokens because the 
access to the directory is secured via Kerberos.</p>
+  </li>
+  <li>
+    <p>Once the resource is allocated for the Application Master, the Node 
Manager will localizes app resources from HDFS using the HDFS delegation tokens 
in the launcher context. Same rule applies to Container localization too.</p>
+  </li>
+  <li>
+    <p>When Application Master starts, it localizes the keytab file into its 
working directory and reads the principal from the coordinator stream.</p>
+  </li>
+  <li>
+    <p>The Application Master periodically re-authenticate itself with the 
given principal and keytab. In each iteration, it creates new delegation tokens 
and stores them in the given job specific staging directory on HDFS.</p>
+  </li>
+  <li>
+    <p>Each running container will get new delegation tokens from the 
credentials file on HDFS before the current ones expire.</p>
+  </li>
+  <li>
+    <p>Application Master and Containers donât communicate with each other 
for that matter. Each side proceeds independently by reading or writing the 
tokens on HDFS.</p>
+  </li>
 </ol>
 
-<p>By default, any HDFS delegation token has a maximum life of 7 days 
(configured by <code>dfs.namenode.delegation.token.max-lifetime</code> in 
hdfs-site.xml) and the token is normally renewed every 24 hours (configured by 
<code>dfs.namenode.delegation.token.renew-interval</code> in hdfs-site.xml). 
What if the Application Master dies and needs restarts after 7 days? The 
original HDFS delegation token stored in the launcher context will be invalid 
no matter what. Luckily, Samza can rely on Resource Manager to handle this 
scenario. See the Configuration section below for details.  </p>
+<p>By default, any HDFS delegation token has a maximum life of 7 days 
(configured by <code class="language-plaintext 
highlighter-rouge">dfs.namenode.delegation.token.max-lifetime</code> in 
hdfs-site.xml) and the token is normally renewed every 24 hours (configured by 
<code class="language-plaintext 
highlighter-rouge">dfs.namenode.delegation.token.renew-interval</code> in 
hdfs-site.xml). What if the Application Master dies and needs restarts after 7 
days? The original HDFS delegation token stored in the launcher context will be 
invalid no matter what. Luckily, Samza can rely on Resource Manager to handle 
this scenario. See the Configuration section below for details.</p>
 
 <h3 id="components">Components</h3>
 
 <h4 id="securitymanager">SecurityManager</h4>
 
-<p>When ApplicationMaster starts, it spawns 
<code>SamzaAppMasterSecurityManager</code>, which runs on its separate thread. 
The <code>SamzaAppMasterSecurityManager</code> is responsible for periodically 
logging in through the given Kerberos keytab and regenerates the HDFS 
delegation tokens regularly. After each run, it writes new tokens on a 
pre-defined job specific directory on HDFS. The frequency of this process is 
determined by <code>yarn.token.renewal.interval.seconds</code>.</p>
+<p>When ApplicationMaster starts, it spawns <code class="language-plaintext 
highlighter-rouge">SamzaAppMasterSecurityManager</code>, which runs on its 
separate thread. The <code class="language-plaintext 
highlighter-rouge">SamzaAppMasterSecurityManager</code> is responsible for 
periodically logging in through the given Kerberos keytab and regenerates the 
HDFS delegation tokens regularly. After each run, it writes new tokens on a 
pre-defined job specific directory on HDFS. The frequency of this process is 
determined by <code class="language-plaintext 
highlighter-rouge">yarn.token.renewal.interval.seconds</code>.</p>
 
-<p>Each container, upon start, runs a 
<code>SamzaContainerSecurityManager</code>. It reads from the credentials file 
on HDFS and refreshes its delegation tokens at the same interval.</p>
+<p>Each container, upon start, runs a <code class="language-plaintext 
highlighter-rouge">SamzaContainerSecurityManager</code>. It reads from the 
credentials file on HDFS and refreshes its delegation tokens at the same 
interval.</p>
 
 <h3 id="configuration">Configuration</h3>
 
 <ol>
-<li>For the Samza job, the following job configurations are required on a YARN 
cluster with security enabled.
-# Job
-job.security.manager.factory=org.apache.samza.job.yarn.SamzaYarnSecurityManagerFactory</li>
+  <li>For the Samza job, the following job configurations are required on a 
YARN cluster with security enabled.
+    <h1 id="job">Job</h1>
+    
<p>job.security.manager.factory=org.apache.samza.job.yarn.SamzaYarnSecurityManagerFactory</p>
+  </li>
 </ol>
 
 <h1 id="yarn">YARN</h1>
 
-<figure class="highlight"><pre><code class="language-properties" 
data-lang="properties"><span></span><span 
class="na">yarn.kerberos.principal</span><span class="o">=</span><span 
class="s">user/localhost</span>
-<span class="na">yarn.kerberos.keytab</span><span class="o">=</span><span 
class="s">/etc/krb5.keytab.user</span>
-<span class="na">yarn.token.renewal.interval.seconds</span><span 
class="o">=</span><span class="s">86400</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-properties" 
data-lang="properties"><span class="py">yarn.kerberos.principal</span><span 
class="p">=</span><span class="s">user/localhost</span>
+<span class="py">yarn.kerberos.keytab</span><span class="p">=</span><span 
class="s">/etc/krb5.keytab.user</span>
+<span class="py">yarn.token.renewal.interval.seconds</span><span 
class="p">=</span><span class="s">86400</span></code></pre></figure>
 
 <ol>
-<li>Configure the Hadoop cluster to enable Resource Manager to recreate and 
renew the delegation token on behalf of the application user. This will address 
the following 2 scenarios.</li>
-</ol>
-<div class="highlight"><pre><code class="language-text" 
data-lang="text"><span></span>* When Application Master dies unexpectedly and 
needs a restart after 7 days (the default maximum lifespan a delegation token 
can be renewed).
+  <li>
+    <p>Configure the Hadoop cluster to enable Resource Manager to recreate and 
renew the delegation token on behalf of the application user. This will address 
the following 2 scenarios.</p>
 
-* When the Samza job terminates and log aggregation is turned on for the job. 
Node managers need to be able to upload all the local application logs to HDFS.
+    <ul>
+      <li>
+        <p>When Application Master dies unexpectedly and needs a restart after 
7 days (the default maximum lifespan a delegation token can be renewed).</p>
+      </li>
+      <li>
+        <p>When the Samza job terminates and log aggregation is turned on for 
the job. Node managers need to be able to upload all the local application logs 
to HDFS.</p>
+      </li>
+    </ul>
+
+    <ol>
+      <li>Enable the resource manager as a privileged user in 
yarn-site.xml.</li>
+    </ol>
+  </li>
+</ol>
 
-1. Enable the resource manager as a privileged user in yarn-site.xml.
-</code></pre></div>
-<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span></span>        <span class="nt">&lt;property&gt;</span>
+<figure class="highlight"><pre><code class="language-xml" data-lang="xml">     
   <span class="nt">&lt;property&gt;</span>
             <span 
class="nt">&lt;name&gt;</span>yarn.resourcemanager.proxy-user-privileges.enabled<span
 class="nt">&lt;/name&gt;</span>
             <span class="nt">&lt;value&gt;</span>true<span 
class="nt">&lt;/value&gt;</span>
         <span class="nt">&lt;/property&gt;</span>
     </code></pre></figure>
-<div class="highlight"><pre><code class="language-text" 
data-lang="text"><span></span>2. Make `yarn` as a proxy user, in core-site.xml
-</code></pre></div>
-<figure class="highlight"><pre><code class="language-xml" 
data-lang="xml"><span></span>        <span class="nt">&lt;property&gt;</span>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>2. Make `yarn` as a proxy user, in core-site.xml
+</code></pre></div></div>
+
+<figure class="highlight"><pre><code class="language-xml" data-lang="xml">     
   <span class="nt">&lt;property&gt;</span>
             <span 
class="nt">&lt;name&gt;</span>hadoop.proxyuser.yarn.hosts<span 
class="nt">&lt;/name&gt;</span>
             <span class="nt">&lt;value&gt;</span>*<span 
class="nt">&lt;/value&gt;</span>
         <span class="nt">&lt;/property&gt;</span>
@@ -708,6 +749,7 @@ job.security.manager.factory=org.apache.
         <span class="nt">&lt;/property&gt;</span>
     </code></pre></figure>
 
+
            
         </div>
       </div>

Modified: samza/site/learn/tutorials/latest/deploy-samza-job-from-hdfs.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/tutorials/latest/deploy-samza-job-from-hdfs.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/tutorials/latest/deploy-samza-job-from-hdfs.html (original)
+++ samza/site/learn/tutorials/latest/deploy-samza-job-from-hdfs.html Wed Jan 
18 19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -545,11 +551,11 @@
    limitations under the License.
 -->
 
-<p>This tutorial uses <a 
href="../../../startup/hello-samza/latest/">hello-samza</a> to illustrate how 
to run a Samza job if you want to publish the Samza job&rsquo;s .tar.gz package 
to HDFS.</p>
+<p>This tutorial uses <a 
href="../../../startup/hello-samza/latest/">hello-samza</a> to illustrate how 
to run a Samza job if you want to publish the Samza jobâs .tar.gz package to 
HDFS.</p>
 
 <h3 id="upload-the-package">Upload the package</h3>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span>hadoop fs -put 
./target/hello-samza-1.1.0-dist.tar.gz /path/for/tgz</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash">hadoop fs <span class="nt">-put</span> 
./target/hello-samza-1.1.0-dist.tar.gz /path/for/tgz</code></pre></figure>
 
 <h3 id="add-hdfs-configuration">Add HDFS configuration</h3>
 
@@ -559,7 +565,7 @@
 
 <p>Change the yarn.package.path in the properties file to your HDFS 
location.</p>
 
-<figure class="highlight"><pre><code class="language-jproperties" 
data-lang="jproperties"><span></span><span 
class="na">yarn.package.path</span><span class="o">=</span><span 
class="s">hdfs://&lt;hdfs name node ip&gt;:&lt;hdfs name node 
port&gt;/path/to/tgz</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-jproperties" 
data-lang="jproperties">yarn.package.path=hdfs://&lt;hdfs name node 
ip&gt;:&lt;hdfs name node port&gt;/path/to/tgz</code></pre></figure>
 
 <p>Then you should be able to run the Samza job as described in <a 
href="../../../startup/hello-samza/latest/">hello-samza</a>.</p>
 

Modified: samza/site/learn/tutorials/latest/deploy-samza-to-CDH.html
URL: 
http://svn.apache.org/viewvc/samza/site/learn/tutorials/latest/deploy-samza-to-CDH.html?rev=1906774&r1=1906773&r2=1906774&view=diff
==============================================================================
--- samza/site/learn/tutorials/latest/deploy-samza-to-CDH.html (original)
+++ samza/site/learn/tutorials/latest/deploy-samza-to-CDH.html Wed Jan 18 
19:33:25 2023
@@ -227,6 +227,12 @@
     
       
         
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.8.0">1.8.0</a>
+      
+        
+      <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.7.0">1.7.0</a>
+      
+        
       <a class="side-navigation__group-item" data-match-active="" 
href="/releases/1.6.0">1.6.0</a>
       
         
@@ -547,39 +553,40 @@
 
 <p>The tutorial assumes you have successfully run <a 
href="../../../startup/hello-samza/latest/">hello-samza</a> and now you want to 
deploy the job to your Cloudera Data Hub (<a 
href="http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html";>CDH</a>).
 This tutorial is based on CDH 5.4.0 and uses hello-samza as the example 
job.</p>
 
-<h3 id="compile-package-for-cdh-5-4-0">Compile Package for CDH 5.4.0</h3>
+<h3 id="compile-package-for-cdh-540">Compile Package for CDH 5.4.0</h3>
 
 <p>We need to use a specific compile option to build hello-samza package for 
CDH 5.4.0</p>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span>mvn clean package -Dhadoop.version<span 
class="o">=</span>cdh5.4.0</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash">mvn clean package <span 
class="nt">-Dhadoop</span>.version<span 
class="o">=</span>cdh5.4.0</code></pre></figure>
 
 <h3 id="upload-package-to-cluster">Upload Package to Cluster</h3>
 
-<p>There are a few ways of uploading the package to the cluster&rsquo;s HDFS. 
If you do not have the job package in your cluster, <strong>scp</strong> from 
you local machine to the cluster. Then run</p>
+<p>There are a few ways of uploading the package to the clusterâs HDFS. If 
you do not have the job package in your cluster, <strong>scp</strong> from you 
local machine to the cluster. Then run</p>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span>hadoop fs -put 
path/to/hello-samza-1.1.0-dist.tar.gz /path/for/tgz</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash">hadoop fs <span class="nt">-put</span> 
path/to/hello-samza-1.1.0-dist.tar.gz /path/for/tgz</code></pre></figure>
 
 <h3 id="get-deploying-scripts">Get Deploying Scripts</h3>
 
 <p>Untar the job package (assume you will run from the current directory)</p>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span>tar -xvf 
path/to/samza-job-package-1.1.0-dist.tar.gz -C ./</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span class="nb">tar</span> <span class="nt">-xvf</span> 
path/to/samza-job-package-1.1.0-dist.tar.gz <span class="nt">-C</span> 
./</code></pre></figure>
 
 <h3 id="add-package-path-to-properties-file">Add Package Path to Properties 
File</h3>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span>vim 
config/wikipedia-parser.properties</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash">vim config/wikipedia-parser.properties</code></pre></figure>
 
 <p>Change the yarn package path:</p>
 
-<figure class="highlight"><pre><code class="language-jproperties" 
data-lang="jproperties"><span></span><span 
class="na">yarn.package.path</span><span class="o">=</span><span 
class="s">hdfs://&lt;hdfs name node ip&gt;:&lt;hdfs name node 
port&gt;/path/to/tgz</span></code></pre></figure>
+<figure class="highlight"><pre><code class="language-jproperties" 
data-lang="jproperties">yarn.package.path=hdfs://&lt;hdfs name node 
ip&gt;:&lt;hdfs name node port&gt;/path/to/tgz</code></pre></figure>
 
 <h3 id="set-yarn-environment-variable">Set Yarn Environment Variable</h3>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span><span class="nb">export</span> <span 
class="nv">HADOOP_CONF_DIR</span><span 
class="o">=</span>/etc/hadoop/conf</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span class="nb">export </span><span 
class="nv">HADOOP_CONF_DIR</span><span 
class="o">=</span>/etc/hadoop/conf</code></pre></figure>
 
 <h3 id="run-samza-job">Run Samza Job</h3>
 
-<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash"><span></span>bin/run-app.sh --config-path<span 
class="o">=</span><span 
class="nv">$PWD</span>/config/wikipedia-parser.properties</code></pre></figure>
+<figure class="highlight"><pre><code class="language-bash" 
data-lang="bash">bin/run-app.sh <span class="nt">--config-path</span><span 
class="o">=</span><span 
class="nv">$PWD</span>/config/wikipedia-parser.properties</code></pre></figure>
+
 
            
         </div>

svn commit: r1906774 [42/49] - in /samza/site: ./ archive/ blog/ case-studies/ community/ contribute/ img/latest/learn/documentation/api/ learn/documentation/latest/ learn/documentation/latest/api/ learn/documentation/latest/api/javadocs/ learn/documen...

Reply via email to