Author: martinkl
Date: Fri Jun 13 22:24:05 2014
New Revision: 1602536
URL: http://svn.apache.org/r1602536
Log:
File omitted from last commit
Added:
incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html
Added: incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html
URL:
http://svn.apache.org/viewvc/incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html?rev=1602536&view=auto
==============================================================================
--- incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html
(added)
+++ incubator/samza/site/learn/documentation/0.7.0/jobs/reprocessing.html Fri
Jun 13 22:24:05 2014
@@ -0,0 +1,202 @@
+<!DOCTYPE html>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<html lang="en">
+ <head>
+ <meta charset="utf-8">
+ <title>Samza - Reprocessing previously processed data</title>
+ <link href='/css/ropa-sans.css' rel='stylesheet' type='text/css'/>
+ <link href="/css/bootstrap.min.css" rel="stylesheet"/>
+ <link href="/css/font-awesome.min.css" rel="stylesheet"/>
+ <link href="/css/main.css" rel="stylesheet"/>
+ <link rel="icon" type="image/png" href="/img/samza-icon.png">
+ </head>
+ <body>
+ <div class="wrapper">
+ <div class="wrapper-content">
+
+ <div class="masthead">
+ <div class="container">
+ <div class="masthead-logo">
+ <a href="/" class="logo">samza</a>
+ </div>
+ <div class="masthead-icons">
+ <div class="pull-right">
+ <a href="/startup/download"><i class="fa
fa-arrow-circle-o-down masthead-icon"></i></a>
+ <a
href="https://git-wip-us.apache.org/repos/asf?p=incubator-samza.git;a=tree"
target="_blank"><i class="fa fa-code masthead-icon" style="font-weight:
bold;"></i></a>
+ <a href="https://twitter.com/samzastream" target="_blank"><i
class="fa fa-twitter masthead-icon"></i></a>
+ </div>
+ </div>
+ </div><!-- /.container -->
+ </div>
+
+ <div class="container">
+ <div class="menu">
+ <h1><i class="fa fa-rocket"></i> Getting Started</h1>
+ <ul>
+ <li><a href="/startup/hello-samza/0.7.0">Hello Samza</a></li>
+ <li><a href="/startup/download">Download</a></li>
+ </ul>
+
+ <h1><i class="fa fa-book"></i> Learn</h1>
+ <ul>
+ <li><a href="/learn/documentation/0.7.0">Documentation</a></li>
+ <li><a href="/learn/tutorials/0.7.0">Tutorials</a></li>
+ <li><a href="http://wiki.apache.org/samza/FAQ">FAQ</a></li>
+ <li><a href="http://wiki.apache.org/samza">Wiki</a></li>
+ <li><a href="http://wiki.apache.org/samza/PapersAndTalks">Papers
& Talks</a></li>
+ <li><a href="http://blogs.apache.org/samza">Blog</a></li>
+ </ul>
+
+ <h1><i class="fa fa-comments"></i> Community</h1>
+ <ul>
+ <li><a href="/community/mailing-lists.html">Mailing
Lists</a></li>
+ <li><a href="/community/irc.html">IRC</a></li>
+ <li><a
href="https://issues.apache.org/jira/browse/SAMZA">Bugs</a></li>
+ <li><a href="http://wiki.apache.org/samza/PoweredBy">Powered
by</a></li>
+ <li><a
href="http://wiki.apache.org/samza/Ecosystem">Ecosystem</a></li>
+ <li><a href="/community/committers.html">Committers</a></li>
+ </ul>
+
+ <h1><i class="fa fa-code"></i> Contribute</h1>
+ <ul>
+ <li><a href="/contribute/rules.html">Rules</a></li>
+ <li><a href="/contribute/coding-guide.html">Coding Guide</a></li>
+ <li><a href="/contribute/projects.html">Projects</a></li>
+ <li><a href="/contribute/seps.html">SEPs</a></li>
+ <li><a href="/contribute/code.html">Code</a></li>
+ <li><a href="https://reviews.apache.org/groups/samza">Review
Board</a></li>
+ <li><a href="https://builds.apache.org/">Unit Tests</a></li>
+ <li><a href="/contribute/disclaimer.html">Disclaimer</a></li>
+ </ul>
+ </div>
+
+ <div class="content">
+ <!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+<h2>Reprocessing previously processed data</h2>
+
+<p>From time to time you may want to deploy a new version of your Samza job
that computes results differently. Perhaps you fixed a bug or introduced a new
feature. For example, say you have a Samza job that classifies messages as spam
or not-spam, using a machine learning model that you train offline.
Periodically you want to deploy an updated version of your Samza job which
includes the latest classification model.</p>
+
+<p>When you start up a new version of your job, a question arises: what do you
want to do with messages that were previously processed with the old version of
your job? The answer depends on the behavior you want:</p>
+
+<ol>
+<li><p><strong>No reprocessing:</strong> By default, Samza assumes that
messages processed by the old version don’t need to be processed again.
When the new version starts up, it will resume processing at the point where
the old version left off (assuming you have <a
href="../container/checkpointing.html">checkpointing</a> enabled). If this is
the behavior you want, there’s nothing special you need to do.</p></li>
+<li><p><strong>Simple rewind:</strong> Perhaps you want to go back and
re-process old messages using the new version of your job. For example, maybe
the old version of your classifier marked things as spam too aggressively, so
you now want to revisit its previous spam/not-spam decisions using an improved
classifier. You can do this by restarting the job at an older point in time in
the stream, and running through all the messages since that time. Thus your job
starts off reprocessing messages that it has already seen, but it then
seamlessly continues with new messages when the reprocessing is done.</p>
+
+<p>This approach requires an input system such as Kafka, which allows you to
jump back in time to a previous point in the stream. We discuss below how this
works in practice.</p></li>
+<li><p><strong>Parallel rewind:</strong> This approach avoids a downside of
the <em>simple rewind</em> approach. With simple rewind, any new messages that
appear while the job is reprocessing old data are queued up, and are processed
when the reprocessing is done. The queueing delay needn’t be long,
because Samza can stream through historical data very quickly, but some
latency-sensitive applications need to process messages faster.</p>
+
+<p>In the <em>parallel rewind</em> approach, you run two jobs in parallel: one
job continues to handle live updates with low latency (the <em>real-time
job</em>), while the other is started at an older point in the stream and
reprocesses historical data (the <em>reprocessing job</em>). The two jobs
consume the same input stream at different points in time, and eventually the
reprocessing job catches up with the real-time job.</p>
+
+<p>There are a few details that you need to think through before deploying
parallel rewind, which we discuss below.</p></li>
+</ol>
+
+<h3 id="toc_0">Jumping Back in Time</h3>
+
+<p>A common aspect of the <em>simple rewind</em> and <em>parallel rewind</em>
approaches is: you have a job which jumps back to an old point in time in the
input streams, and consumes all messages since that time. You achieve this by
working with Samza’s checkpoints.</p>
+
+<p>Normally, when a Samza job starts up, it reads the latest checkpoint to
determine at which offset in the input streams it needs to resume processing.
If you need to rewind to an earlier time, you do that in one of two ways:</p>
+
+<ol>
+<li>You can stop the job, manipulate its last checkpoint to point to an older
offset, and start the job up again. Samza includes a command-line tool called
<a href="../container/checkpointing.html#toc_0">CheckpointTool</a> which you
can use to manipulate checkpoints.</li>
+<li>You can start a new job with a different <em>job.name</em> or
<em>job.id</em> (e.g. increment <em>job.id</em> every time you need to jump
back in time). This gives the job a new checkpoint stream, with none of the old
checkpoint information. You also need to set <a
href="../container/checkpointing.html">samza.offset.default=oldest</a>, so that
when the job starts up without checkpoint, it starts consuming at the oldest
offset available.</li>
+</ol>
+
+<p>With either of these approaches you can get Samza to reprocess the entire
history of messages in the input system. Input systems such as Kafka can retain
a large amount of history — see discussion below. In order to speed up
the reprocessing of historical data, you can increase the container count
(<em>yarn.container.count</em> if you’re running Samza on YARN) to boost
your job’s computational resources.</p>
+
+<p>If your job maintains any <a
href="../container/state-management.html">persistent state</a>, you need to be
careful when jumping back in time: resetting a checkpoint does not
automatically change persistent state, so you could end up reprocessing old
messages while using state from a later point in time. In most cases, a job
that jumps back in time should start with an empty state. You can reset the
state by deleting the changelog topic, or by changing the name of the changelog
topic in your job configuration.</p>
+
+<p>When you’re jumping back in time, you’re using Samza somewhat
like a batch processing framework (e.g. MapReduce) — with the difference
that your job doesn’t stop when it has processed all the historical data,
but instead continues running, incrementally processing the stream of new
messages as they come in. This has the advantage that you don’t need to
write and maintain separate batch and streaming versions of your job: you can
just use the same Samza API for processing both real-time and historical
data.</p>
+
+<h3 id="toc_1">Retention of history</h3>
+
+<p>Samza doesn’t maintain history itself — that is the
responsibility of the input system, such as Kafka. How far back in time you can
jump depends on the amount of history that is retained in that system.</p>
+
+<p>Kafka is designed to keep a fairly large amount of history: it is common
for Kafka brokers to keep one or two weeks of message history accessible, even
for high volume topics. The retention period is mostly determined by how much
disk space you have available. Kafka’s performance <a
href="http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines">remains
high</a> even if you have terabytes of history.</p>
+
+<p>There are two different kinds of history which require different
configuration:</p>
+
+<ul>
+<li><strong>Activity events</strong> are things like user tracking events, web
server log events and the like. This kind of stream is typically configured
with a time-based retention, e.g. a few weeks. Events older than the retention
period are deleted (or archived in an offline system such as HDFS).</li>
+<li><strong>Database changes</strong> are events that show inserts, updates
and deletes in a database. In this kind of stream, each event typically has a
primary key, and a newer event for a key overwrites any older events for the
same key. If the same key is updated many times, you’re only really
interested in the most recent value. (The <a
href="../container/state-management.html">changelog streams</a> used by
Samza’s persistent state fall in this category.)</li>
+</ul>
+
+<p>In a database change stream, when you’re reprocessing data, you
typically want to reprocess the entire database. You don’t want to miss a
value just because it was last updated more than a few weeks ago. In other
words, you don’t want change events to be deleted just because they are
older than some threshold. In this case, when you’re jumping back in
time, you need to rewind to the <em>beginning of time</em>, to the first change
ever made to the database (known in Kafka as “offset 0”).</p>
+
+<p>Fortunately this can be done efficiently, using a Kafka feature called <a
href="http://kafka.apache.org/documentation.html#compaction">log
compaction</a>. </p>
+
+<p>For example, imagine your database contains counters: every time something
happens, you increment the appropriate counters and update the database with
the new counter values. Every update is sent to the changelog, and because
there are many updates, the changelog stream will take up a lot of space. With
log compaction turned on, Kafka deduplicates the stream in the background,
keeping only the most recent counter value for each key, and deleting any old
values for the same counter. This reduces the size of the stream so much that
you can keep the most recent update for every key, even if it was last updated
long ago.</p>
+
+<p>With log compaction enabled, the stream of database changes becomes a full
copy of the entire database. By jumping back to offset 0, your Samza job can
scan over the entire database and reprocess it. This is a very powerful way of
building scalable applications.</p>
+
+<h3 id="toc_2">Details of Parallel Rewind</h3>
+
+<p>If you are taking the <em>parallel rewind</em> approach described above,
running two jobs in parallel, you need to configure them carefully to avoid
problems. In particular, some things to look out for:</p>
+
+<ul>
+<li>Make sure that the two jobs don’t interfere with each other. They
need different <em>job.name</em> or <em>job.id</em> configuration properties,
so that each job gets its own checkpoint stream. If the jobs maintain <a
href="../container/state-management.html">persistent state</a>, each job needs
its own changelog (two different jobs writing to the same changelog produces
undefined results).</li>
+<li>What happens to job output? If the job sends its results to an output
stream, or writes to a database, then the easiest solution is for each job to
have a separate output stream or database table. If they write to the same
output, you need to take care to ensure that newer data isn’t overwritten
with older data (due to race conditions between the two jobs).</li>
+<li>Do you need to support A/B testing between the old and the new version of
your job, e.g. to test whether the new version improves your metrics? Parallel
rewind is ideal for this: each job writes to a separate output, and clients or
consumers of the output can read from either the old or the new version’s
output, depending on whether a user is in test group A or B.</li>
+<li>Reclaiming resources: you might want to keep the old version of your job
running for a while, even when the new version has finished reprocessing
historical data (especially if the old version’s output is being used in
an A/B test). However, eventually you’ll want to shut it down, and delete
the checkpoint and changelog streams belonging to the old version.</li>
+</ul>
+
+<p>Samza gives you a lot of flexibility for reprocessing historical data, and
you don’t need to program against a separate batch processing API to take
advantage of it. If you’re mindful of these issues, you can build a data
system that is very robust, but still gives you lots of freedom to change your
processing logic in future.</p>
+
+<h2 id="toc_3"><a href="../yarn/application-master.html">Application Master
»</a></h2>
+
+
+ </div>
+ </div>
+
+ </div><!-- /.wrapper-content -->
+ </div><!-- /.wrapper -->
+
+ <div class="footer">
+ <div class="container">
+ <!-- nothing for now. -->
+ </div>
+ </div>
+
+ <!-- Google Analytics -->
+ <script>
+
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new
Date();a=s.createElement(o),
+
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+ ga('create', 'UA-43122768-1', 'apache.org');
+ ga('send', 'pageview');
+
+ </script>
+ </body>
+</html>