[04/39] SAMZA-259: Restructure documentation folders

yanfang Thu, 14 Aug 2014 22:24:06 -0700

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/state-management.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/state-management.md 
b/docs/learn/documentation/versioned/container/state-management.md
new file mode 100644
index 0000000..d71de0b
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/state-management.md
@@ -0,0 +1,238 @@
+---
+layout: page
+title: State Management
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+One of the more interesting features of Samza is stateful stream processing. 
Tasks can store and query data through APIs provided by Samza. That data is 
stored on the same machine as the stream task; compared to connecting over the 
network to a remote database, Samza's local state allows you to read and write 
large amounts of data with better performance. Samza replicates this state 
across multiple machines for fault-tolerance (described in detail below).
+
+Some stream processing jobs don't require state: if you only need to transform 
one message at a time, or filter out messages based on some condition, your job 
can be simple. Every call to your task's [process method](../api/overview.html) 
handles one incoming message, and each message is independent of all the other 
messages.
+
+However, being able to maintain state opens up many possibilities for 
sophisticated stream processing jobs: joining input streams, grouping messages 
and aggregating groups of messages. By analogy to SQL, the *select* and *where* 
clauses of a query are usually stateless, but *join*, *group by* and 
aggregation functions like *sum* and *count* require state. Samza doesn't yet 
provide a higher-level SQL-like language, but it does provide lower-level 
primitives that you can use to implement streaming aggregation and joins.
+
+### Common use cases for stateful processing
+
+First, let's look at some simple examples of stateful stream processing that 
might be seen in the backend of a consumer website. Later in this page we'll 
discuss how to implement these applications using Samza's built-in key-value 
storage capabilities.
+
+#### Windowed aggregation
+
+*Example: Counting the number of page views for each user per hour*
+
+In this case, your state typically consists of a number of counters which are 
incremented when a message is processed. The aggregation is typically limited 
to a time window (e.g. 1 minute, 1 hour, 1 day) so that you can observe changes 
of activity over time. This kind of windowed processing is common for ranking 
and relevance, detecting "trending topics", as well as real-time reporting and 
monitoring.
+
+The simplest implementation keeps this state in memory (e.g. a hash map in the 
task instances), and writes it to a database or output stream at the end of 
every time window. However, you need to consider what happens when a container 
fails and your in-memory state is lost. You might be able to restore it by 
processing all the messages in the current window again, but that might take a 
long time if the window covers a long period of time. Samza can speed up this 
recovery by making the state fault-tolerant rather than trying to recompute it.
+
+#### Table-table join
+
+*Example: Join a table of user profiles to a table of user settings by 
user\_id and emit the joined stream*
+
+You might wonder: does it make sense to join two tables in a stream processing 
system? It does if your database can supply a log of all the changes in the 
database. There is a [duality between a database and a changelog 
stream](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying):
 you can publish every data change to a stream, and if you consume the entire 
stream from beginning to end, you can reconstruct the entire contents of the 
database. Samza is designed for data processing jobs that follow this 
philosophy.
+
+If you have changelog streams for several database tables, you can write a 
stream processing job which keeps the latest state of each table in a local 
key-value store, where you can access it much faster than by making queries to 
the original database. Now, whenever data in one table changes, you can join it 
with the latest data for the same key in the other table, and output the joined 
result.
+
+There are several real-life examples of data normalization which essentially 
work in this way:
+
+* E-commerce companies like Amazon and EBay need to import feeds of 
merchandise from merchants, normalize them by product, and present products 
with all the associated merchants and pricing information.
+* Web search requires building a crawler which creates essentially a [table of 
web page contents](http://labs.yahoo.com/files/YahooWebmap.pdf) and joins on 
all the relevance attributes such as click-through ratio or pagerank.
+* Social networks take feeds of user-entered text and need to normalize out 
entities such as companies, schools, and skills.
+
+Each of these use cases is a massively complex data normalization problem that 
can be thought of as constructing a materialized view over many input tables. 
Samza can help implement such data processing pipelines robustly.
+
+#### Stream-table join
+
+*Example: Augment a stream of page view events with the user's ZIP code 
(perhaps to allow aggregation by zip code in a later stage)*
+
+Joining side-information to a real-time feed is a classic use for stream 
processing. It's particularly common in advertising, relevance ranking, fraud 
detection and other domains. Activity events such as page views generally only 
include a small number of attributes, such as the ID of the viewer and the 
viewed items, but not detailed attributes of the viewer and the viewed items, 
such as the ZIP code of the user. If you want to aggregate the stream by 
attributes of the viewer or the viewed items, you need to join with the users 
table or the items table respectively.
+
+In data warehouse terminology, you can think of the raw event stream as rows 
in the central fact table, which needs to be joined with dimension tables so 
that you can use attributes of the dimensions in your analysis.
+
+#### Stream-stream join
+
+*Example: Join a stream of ad clicks to a stream of ad impressions (to link 
the information on when the ad was shown to the information on when it was 
clicked)*
+
+A stream join is useful for "nearly aligned" streams, where you expect to 
receive related events on several input streams, and you want to combine them 
into a single output event. You cannot rely on the events arriving at the 
stream processor at the same time, but you can set a maximum period of time 
over which you allow the events to be spread out.
+
+In order to perform a join between streams, your job needs to buffer events 
for the time window over which you want to join. For short time windows, you 
can do this in memory (at the risk of losing events if the machine fails). You 
can also use Samza's state store to buffer events, which supports buffering 
more messages than you can fit in memory.
+
+#### More
+
+There are many variations of joins and aggregations, but most are essentially 
variations and combinations of the above patterns.
+
+### Approaches to managing task state
+
+So how do systems support this kind of stateful processing? We'll lead in by 
describing what we have seen in other stream processing systems, and then 
describe what Samza does.
+
+#### In-memory state with checkpointing
+
+A simple approach, common in academic stream processing systems, is to 
periodically save the task's entire in-memory data to durable storage. This 
approach works well if the in-memory state consists of only a few values. 
However, you have to store the complete task state on each checkpoint, which 
becomes increasingly expensive as task state grows. Unfortunately, many 
non-trivial use cases for joins and aggregation have large amounts of state 
&mdash; often many gigabytes. This makes full dumps of the state impractical.
+
+Some academic systems produce *diffs* in addition to full checkpoints, which 
are smaller if only some of the state has changed since the last checkpoint. 
[Storm's Trident abstraction](../comparisons/storm.html) similarly keeps an 
in-memory cache of state, and periodically writes any changes to a remote store 
such as Cassandra. However, this optimization only helps if most of the state 
remains unchanged. In some use cases, such as stream joins, it is normal to 
have a lot of churn in the state, so this technique essentially degrades to 
making a remote database request for every message (see below).
+
+#### Using an external store
+
+Another common pattern for stateful processing is to store the state in an 
external database or key-value store. Conventional database replication can be 
used to make that database fault-tolerant. The architecture looks something 
like this:
+
+![state-kv-store](/img/{{site.version}}/learn/documentation/container/stream_job_and_db.png)
+
+Samza allows this style of processing &mdash; there is nothing to stop you 
querying a remote database or service within your job. However, there are a few 
reasons why a remote database can be problematic for stateful stream processing:
+
+1. **Performance**: Making database queries over a network is slow and 
expensive. A Kafka stream can deliver hundreds of thousands or even millions of 
messages per second per CPU core to a stream processor, but if you need to make 
a remote request for every message you process, your throughput is likely to 
drop by 2-3 orders of magnitude. You can somewhat mitigate this with careful 
caching of reads and batching of writes, but then you're back to the problems 
of checkpointing, discussed above.
+2. **Isolation**: If your database or service also serves requests to users, 
it can be dangerous to use the same database with a stream processor. A 
scalable stream processing system can run with very high throughput, and easily 
generates a huge amount of load (for example when catching up on a queue 
backlog). If you're not very careful, you may cause a denial-of-service attack 
on your own database, and cause problems for interactive requests from users.
+3. **Query Capabilities**: Many scalable databases expose very limited query 
interfaces (e.g. only supporting simple key-value lookups), because the 
equivalent of a "full table scan" or rich traversal would be too expensive. 
Stream processes are often less latency-sensitive, so richer query capabilities 
would be more feasible.
+4. **Correctness**: When a stream processor fails and needs to be restarted, 
how is the database state made consistent with the processing task? For this 
purpose, some frameworks such as [Storm](../comparisons/storm.html) attach 
metadata to database entries, but it needs to be handled carefully, otherwise 
the stream process generates incorrect output.
+5. **Reprocessing**: Sometimes it can be useful to re-run a stream process on 
a large amount of historical data, e.g. after updating your processing task's 
code. However, the issues above make this impractical for jobs that make 
external queries.
+
+### Local state in Samza
+
+Samza allows tasks to maintain state in a way that is different from the 
approaches described above:
+
+* The state is stored on disk, so the job can maintain more state than would 
fit in memory.
+* It is stored on the same machine as the processing task, to avoid the 
performance problems of making database queries over the network.
+* Each job has its own datastore, to avoid the isolation problems of a shared 
database (if you make an expensive query, it affects only the current task, 
nobody else).
+* Different storage engines can be plugged in, enabling rich query 
capabilities.
+* The state is continuously replicated, enabling fault tolerance without the 
problems of checkpointing large amounts of state.
+
+Imagine you take a remote database, partition it to match the number of tasks 
in the stream processing job, and co-locate each partition with its task. The 
result looks like this:
+
+![state-local](/img/{{site.version}}/learn/documentation/container/stateful_job.png)
+
+If a machine fails, all the tasks running on that machine and their database 
partitions are lost. In order to make them highly available, all writes to the 
database partition are replicated to a durable changelog (typically Kafka). 
Now, when a machine fails, we can restart the tasks on another machine, and 
consume this changelog in order to restore the contents of the database 
partition.
+
+Note that each task only has access to its own database partition, not to any 
other task's partition. This is important: when you scale out your job by 
giving it more computing resources, Samza needs to move tasks from one machine 
to another. By giving each task its own state, tasks can be relocated without 
affecting the job's operation. If necessary, you can repartition your streams 
so that all messages for a particular database partition are routed to the same 
task instance.
+
+[Log compaction](http://kafka.apache.org/documentation.html#compaction) runs 
in the background on the changelog topic, and ensures that the changelog does 
not grow indefinitely. If you overwrite the same value in the store many times, 
log compaction keeps only the most recent value, and throws away any old values 
in the log. If you delete an item from the store, log compaction also removes 
it from the log. With the right tuning, the changelog is not much bigger than 
the database itself.
+
+With this architecture, Samza allows tasks to maintain large amounts of 
fault-tolerant state, at a performance that is almost as good as a pure 
in-memory implementation. There are just a few limitations:
+
+* If you have some data that you want to share between tasks (across partition 
boundaries), you need to go to some additional effort to repartition and 
distribute the data. Each task will need its own copy of the data, so this may 
use more space overall.
+* When a container is restarted, it can take some time to restore the data in 
all of its state partitions. The time depends on the amount of data, the 
storage engine, your access patterns, and other factors. As a rule of thumb, 
50&nbsp;MB/sec is a reasonable restore time to expect.
+
+Nothing prevents you from using an external database if you want to, but for 
many use cases, Samza's local state is a powerful tool for enabling stateful 
stream processing.
+
+### Key-value storage
+
+Any storage engine can be plugged into Samza, as described below. Out of the 
box, Samza ships with a key-value store implementation that is built on 
[LevelDB](https://code.google.com/p/leveldb) using a [JNI 
API](https://github.com/fusesource/leveldbjni).
+
+LevelDB has several nice properties. Its memory allocation is outside of the 
Java heap, which makes it more memory-efficient and less prone to garbage 
collection pauses than a Java-based storage engine. It is very fast for small 
datasets that fit in memory; datasets larger than memory are slower but still 
possible. It is 
[log-structured](http://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/),
 allowing very fast writes. It also includes support for block compression, 
which helps to reduce I/O and memory usage.
+
+Samza includes an additional in-memory caching layer in front of LevelDB, 
which avoids the cost of deserialization for frequently-accessed objects and 
batches writes. If the same key is updated multiple times in quick succession, 
the batching coalesces those updates into a single write. The writes are 
flushed to the changelog when a task [commits](checkpointing.html).
+
+To use a key-value store in your job, add the following to your job config:
+
+{% highlight jproperties %}
+# Use the key-value store implementation for a store called "my-store"
+stores.my-store.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory
+
+# Use the Kafka topic "my-store-changelog" as the changelog stream for this 
store.
+# This enables automatic recovery of the store after a failure. If you don't
+# configure this, no changelog stream will be generated.
+stores.my-store.changelog=kafka.my-store-changelog
+
+# Encode keys and values in the store as UTF-8 strings.
+serializers.registry.string.class=org.apache.samza.serializers.StringSerdeFactory
+stores.my-store.key.serde=string
+stores.my-store.msg.serde=string
+{% endhighlight %}
+
+See the [serialization section](serialization.html) for more information on 
the *serde* options.
+
+Here is a simple example that writes every incoming message to the store:
+
+{% highlight java %}
+public class MyStatefulTask implements StreamTask, InitableTask {
+  private KeyValueStore<String, String> store;
+
+  public void init(Config config, TaskContext context) {
+    this.store = (KeyValueStore<String, String>) context.getStore("my-store");
+  }
+
+  public void process(IncomingMessageEnvelope envelope,
+                      MessageCollector collector,
+                      TaskCoordinator coordinator) {
+    store.put((String) envelope.getKey(), (String) envelope.getMessage());
+  }
+}
+{% endhighlight %}
+
+Here is the complete key-value store API:
+
+{% highlight java %}
+public interface KeyValueStore<K, V> {
+  V get(K key);
+  void put(K key, V value);
+  void putAll(List<Entry<K,V>> entries);
+  void delete(K key);
+  KeyValueIterator<K,V> range(K from, K to);
+  KeyValueIterator<K,V> all();
+}
+{% endhighlight %}
+
+Additional configuration properties for the key-value store are documented in 
the [configuration reference](../jobs/configuration-table.html#keyvalue).
+
+### Implementing common use cases with the key-value store
+
+Earlier in this section we discussed some example use cases for stateful 
stream processing. Let's look at how each of these could be implemented using a 
key-value storage engine such as Samza's LevelDB.
+
+#### Windowed aggregation
+
+*Example: Counting the number of page views for each user per hour*
+
+Implementation: You need two processing stages.
+
+1. The first one re-partitions the input data by user ID, so that all the 
events for a particular user are routed to the same stream task. If the input 
stream is already partitioned by user ID, you can skip this.
+2. The second stage does the counting, using a key-value store that maps a 
user ID to the running count. For each new event, the job reads the current 
count for the appropriate user from the store, increments it, and writes it 
back. When the window is complete (e.g. at the end of an hour), the job 
iterates over the contents of the store and emits the aggregates to an output 
stream.
+
+Note that this job effectively pauses at the hour mark to output its results. 
This is totally fine for Samza, as scanning over the contents of the key-value 
store is quite fast. The input stream is buffered while the job is doing this 
hourly work.
+
+#### Table-table join
+
+*Example: Join a table of user profiles to a table of user settings by 
user\_id and emit the joined stream*
+
+Implementation: The job subscribes to the change streams for the user profiles 
database and the user settings database, both partitioned by user\_id. The job 
keeps a key-value store keyed by user\_id, which contains the latest profile 
record and the latest settings record for each user\_id. When a new event comes 
in from either stream, the job looks up the current value in its store, updates 
the appropriate fields (depending on whether it was a profile update or a 
settings update), and writes back the new joined record to the store. The 
changelog of the store doubles as the output stream of the task.
+
+#### Table-stream join
+
+*Example: Augment a stream of page view events with the user's ZIP code 
(perhaps to allow aggregation by zip code in a later stage)*
+
+Implementation: The job subscribes to the stream of user profile updates and 
the stream of page view events. Both streams must be partitioned by user\_id. 
The job maintains a key-value store where the key is the user\_id and the value 
is the user's ZIP code. Every time the job receives a profile update, it 
extracts the user's new ZIP code from the profile update and writes it to the 
store. Every time it receives a page view event, it reads the zip code for that 
user from the store, and emits the page view event with an added ZIP code field.
+
+If the next stage needs to aggregate by ZIP code, the ZIP code can be used as 
the partitioning key of the job's output stream. That ensures that all the 
events for the same ZIP code are sent to the same stream partition.
+
+#### Stream-stream join
+
+*Example: Join a stream of ad clicks to a stream of ad impressions (to link 
the information on when the ad was shown to the information on when it was 
clicked)*
+
+In this example we assume that each impression of an ad has a unique 
identifier, e.g. a UUID, and that the same identifier is included in both the 
impression and the click events. This identifier is used as the join key.
+
+Implementation: Partition the ad click and ad impression streams by the 
impression ID or user ID (assuming that two events with the same impression ID 
always have the same user ID). The task keeps two stores, one containing click 
events and one containing impression events, using the impression ID as key for 
both stores. When the job receives a click event, it looks for the 
corresponding impression in the impression store, and vice versa. If a match is 
found, the joined pair is emitted and the entry is deleted. If no match is 
found, the event is written to the appropriate store. Periodically the job 
scans over both stores and deletes any old events that were not matched within 
the time window of the join.
+
+### Other storage engines
+
+Samza's fault-tolerance mechanism (sending a local store's writes to a 
replicated changelog) is completely decoupled from the storage engine's data 
structures and query APIs. While a key-value storage engine is good for 
general-purpose processing, you can easily add your own storage engines for 
other types of queries by implementing the 
[StorageEngine](../api/javadocs/org/apache/samza/storage/StorageEngine.html) 
interface. Samza's model is especially amenable to embedded storage engines, 
which run as a library in the same process as the stream task. 
+
+Some ideas for other storage engines that could be useful: a persistent heap 
(for running top-N queries), [approximate 
algorithms](http://infolab.stanford.edu/~ullman/mmds/ch4.pdf) such as [bloom 
filters](http://en.wikipedia.org/wiki/Bloom_filter) and 
[hyperloglog](http://research.google.com/pubs/pub40671.html), or full-text 
indexes such as [Lucene](http://lucene.apache.org). (Patches accepted!)
+
+### Fault tolerance semantics with state
+
+As discussed in the section on [checkpointing](checkpointing.html), Samza 
currently only supports at-least-once delivery guarantees in the presence of 
failure (this is sometimes referred to as "guaranteed delivery"). This means 
that if a task fails, no messages are lost, but some messages may be 
redelivered.
+
+For many of the stateful processing use cases discussed above, this is not a 
problem: if the effect of a message on state is idempotent, it is safe for the 
same message to be processed more than once. For example, if the store contains 
the ZIP code for each user, then processing the same profile update twice has 
no effect, because the duplicate update does not change the ZIP code.
+
+However, for non-idempotent operations such as counting, at-least-once 
delivery guarantees can give incorrect results. If a Samza task fails and is 
restarted, it may double-count some messages that were processed shortly before 
the failure. We are planning to address this limitation in a future release of 
Samza.
+
+## [Metrics &raquo;](metrics.html)


http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/streams.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/streams.md 
b/docs/learn/documentation/versioned/container/streams.md
new file mode 100644
index 0000000..59e0855
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/streams.md
@@ -0,0 +1,139 @@
+---
+layout: page
+title: Streams
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+The [samza container](samza-container.html) reads and writes messages using 
the 
[SystemConsumer](../api/javadocs/org/apache/samza/system/SystemConsumer.html) 
and 
[SystemProducer](../api/javadocs/org/apache/samza/system/SystemProducer.html) 
interfaces. You can integrate any message broker with Samza by implementing 
these two interfaces.
+
+{% highlight java %}
+public interface SystemConsumer {
+  void start();
+
+  void stop();
+
+  void register(
+      SystemStreamPartition systemStreamPartition,
+      String lastReadOffset);
+
+  List<IncomingMessageEnvelope> poll(
+      Map<SystemStreamPartition, Integer> systemStreamPartitions,
+      long timeout)
+    throws InterruptedException;
+}
+
+public class IncomingMessageEnvelope {
+  public Object getMessage() { ... }
+
+  public Object getKey() { ... }
+
+  public SystemStreamPartition getSystemStreamPartition() { ... }
+}
+
+public interface SystemProducer {
+  void start();
+
+  void stop();
+
+  void register(String source);
+
+  void send(String source, OutgoingMessageEnvelope envelope);
+
+  void flush(String source);
+}
+
+public class OutgoingMessageEnvelope {
+  ...
+  public Object getKey() { ... }
+
+  public Object getMessage() { ... }
+}
+{% endhighlight %}
+
+Out of the box, Samza supports Kafka (KafkaSystemConsumer and 
KafkaSystemProducer). However, any message bus system can be plugged in, as 
long as it can provide the semantics required by Samza, as described in the 
[javadoc](../api/javadocs/org/apache/samza/system/SystemConsumer.html).
+
+SystemConsumers and SystemProducers may read and write messages of any data 
type. It's ok if they only support byte arrays &mdash; Samza has a separate 
[serialization layer](serialization.html) which converts to and from objects 
that application code can use. Samza does not prescribe any particular data 
model or serialization format.
+
+The job configuration file can include properties that are specific to a 
particular consumer and producer implementation. For example, the configuration 
would typically indicate the hostname and port of the message broker to use, 
and perhaps connection options.
+
+### How streams are processed
+
+If a job is consuming messages from more than one input stream, and all input 
streams have messages available, messages are processed in a round robin 
fashion by default. For example, if a job is consuming AdImpressionEvent and 
AdClickEvent, the task instance's process() method is called with a message 
from AdImpressionEvent, then a message from AdClickEvent, then another message 
from AdImpressionEvent, ... and continues to alternate between the two.
+
+If one of the input streams has no new messages available (the most recent 
message has already been consumed), that stream is skipped, and the job 
continues to consume from the other inputs. It continues to check for new 
messages becoming available.
+
+#### MessageChooser
+
+When a Samza container has several incoming messages on different stream 
partitions, how does it decide which to process first? The behavior is 
determined by a 
[MessageChooser](../api/javadocs/org/apache/samza/system/chooser/MessageChooser.html).
 The default chooser is RoundRobinChooser, but you can override it by 
implementing a custom chooser.
+
+To plug in your own message chooser, you need to implement the 
[MessageChooserFactory](../api/javadocs/org/apache/samza/system/chooser/MessageChooserFactory.html)
 interface, and set the "task.chooser.class" configuration to the 
fully-qualified class name of your implementation:
+
+{% highlight jproperties %}
+task.chooser.class=com.example.samza.YourMessageChooserFactory
+{% endhighlight %}
+
+#### Prioritizing input streams
+
+There are certain times when messages from one stream should be processed with 
higher priority than messages from another stream. For example, some Samza jobs 
consume two streams: one stream is fed by a real-time system and the other 
stream is fed by a batch system. In this case, it's useful to prioritize the 
real-time stream over the batch stream, so that the real-time processing 
doesn't slow down if there is a sudden burst of data on the batch stream.
+
+Samza provides a mechanism to prioritize one stream over another by setting 
this configuration parameter: 
systems.&lt;system&gt;.streams.&lt;stream&gt;.samza.priority=&lt;number&gt;. 
For example:
+
+{% highlight jproperties %}
+systems.kafka.streams.my-real-time-stream.samza.priority=2
+systems.kafka.streams.my-batch-stream.samza.priority=1
+{% endhighlight %}
+
+This declares that my-real-time-stream's messages should be processed with 
higher priority than my-batch-stream's messages. If my-real-time-stream has any 
messages available, they are processed first. Only if there are no messages 
currently waiting on my-real-time-stream, the Samza job continues processing 
my-batch-stream.
+
+Each priority level gets its own MessageChooser. It is valid to define two 
streams with the same priority. If messages are available from two streams at 
the same priority level, it's up to the MessageChooser for that priority level 
to decide which message should be processed first.
+
+It's also valid to only define priorities for some streams. All 
non-prioritized streams are treated as the lowest priority, and share a 
MessageChooser.
+
+#### Bootstrapping
+
+Sometimes, a Samza job needs to fully consume a stream (from offset 0 up to 
the most recent message) before it processes messages from any other stream. 
This is useful in situations where the stream contains some prerequisite data 
that the job needs, and it doesn't make sense to process messages from other 
streams until the job has loaded that prerequisite data. Samza supports this 
use case with *bootstrap streams*.
+
+A bootstrap stream seems similar to a stream with a high priority, but is 
subtly different. Before allowing any other stream to be processed, a bootstrap 
stream waits for the consumer to explicitly confirm that the stream has been 
fully consumed. Until then, the bootstrap stream is the exclusive input to the 
job: even if a network issue or some other factor causes the bootstrap stream 
consumer to slow down, other inputs can't sneak their messages in.
+
+Another difference between a bootstrap stream and a high-priority stream is 
that the bootstrap stream's special treatment is temporary: when it has been 
fully consumed (we say it has "caught up"), its priority drops to be the same 
as all the other input streams.
+
+To configure a stream called "my-bootstrap-stream" to be a fully-consumed 
bootstrap stream, use the following settings:
+
+{% highlight jproperties %}
+systems.kafka.streams.my-bootstrap-stream.samza.bootstrap=true
+systems.kafka.streams.my-bootstrap-stream.samza.reset.offset=true
+systems.kafka.streams.my-bootstrap-stream.samza.offset.default=oldest
+{% endhighlight %}
+
+The bootstrap=true parameter enables the bootstrap behavior (prioritization 
over other streams). The combination of reset.offset=true and 
offset.default=oldest tells Samza to always start reading the stream from the 
oldest offset, every time a container starts up (rather than starting to read 
from the most recent checkpoint).
+
+It is valid to define multiple bootstrap streams. In this case, the order in 
which they are bootstrapped is determined by the priority.
+
+#### Batching
+
+In some cases, you can improve performance by consuming several messages from 
the same stream partition in sequence. Samza supports this mode of operation, 
called *batching*.
+
+For example, if you want to read 100 messages in a row from each stream 
partition (regardless of the MessageChooser), you can use this configuration 
parameter:
+
+{% highlight jproperties %}
+task.consumer.batch.size=100
+{% endhighlight %}
+
+With this setting, Samza tries to read a message from the most recently used 
[SystemStreamPartition](../api/javadocs/org/apache/samza/system/SystemStreamPartition.html).
 This behavior continues either until no more messages are available for that 
SystemStreamPartition, or until the batch size has been reached. When that 
happens, Samza defers to the MessageChooser to determine the next message to 
process. It then again tries to continue consume from the chosen message's 
SystemStreamPartition until the batch size is reached.
+
+## [Serialization &raquo;](serialization.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/windowing.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/windowing.md 
b/docs/learn/documentation/versioned/container/windowing.md
new file mode 100644
index 0000000..b10e5d4
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/windowing.md
@@ -0,0 +1,61 @@
+---
+layout: page
+title: Windowing
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Sometimes a stream processing job needs to do something in regular time 
intervals, regardless of how many incoming messages the job is processing. For 
example, say you want to report the number of page views per minute. To do 
this, you increment a counter every time you see a page view event. Once per 
minute, you send the current counter value to an output stream and reset the 
counter to zero.
+
+Samza's *windowing* feature provides a way for tasks to do something in 
regular time intervals, for example once per minute. To enable windowing, you 
just need to set one property in your job configuration:
+
+{% highlight jproperties %}
+# Call the window() method every 60 seconds
+task.window.ms=60000
+{% endhighlight %}
+
+Next, your stream task needs to implement the 
[WindowableTask](../api/javadocs/org/apache/samza/task/WindowableTask.html) 
interface. This interface defines a window() method which is called by Samza in 
the regular interval that you configured.
+
+For example, this is how you would implement a basic per-minute event counter:
+
+{% highlight java %}
+public class EventCounterTask implements StreamTask, WindowableTask {
+
+  public static final SystemStream OUTPUT_STREAM =
+    new SystemStream("kafka", "events-per-minute");
+
+  private int eventsSeen = 0;
+
+  public void process(IncomingMessageEnvelope envelope,
+                      MessageCollector collector,
+                      TaskCoordinator coordinator) {
+    eventsSeen++;
+  }
+
+  public void window(MessageCollector collector,
+                     TaskCoordinator coordinator) {
+    collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, eventsSeen));
+    eventsSeen = 0;
+  }
+}
+{% endhighlight %}
+
+If you need to send messages to output streams, you can use the 
[MessageCollector](../api/javadocs/org/apache/samza/task/MessageCollector.html) 
object passed to the window() method. Please only use that MessageCollector 
object for sending messages, and don't use it outside of the call to window().
+
+Note that Samza uses [single-threaded execution](event-loop.html), so the 
window() call can never happen concurrently with a process() call. This has the 
advantage that you don't need to worry about thread safety in your code (no 
need to synchronize anything), but the downside that the window() call may be 
delayed if your process() method takes a long time to return.
+
+## [Event Loop &raquo;](event-loop.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/index.html
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/index.html 
b/docs/learn/documentation/versioned/index.html
new file mode 100644
index 0000000..626631b
--- /dev/null
+++ b/docs/learn/documentation/versioned/index.html
@@ -0,0 +1,92 @@
+---
+layout: page
+title: Documentation
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<h4>Introduction</h4>
+
+<ul class="documentation-list">
+  <li><a href="introduction/background.html">Background</a></li>
+  <li><a href="introduction/concepts.html">Concepts</a></li>
+  <li><a href="introduction/architecture.html">Architecture</a></li>
+</ul>
+
+<h4>Comparisons</h4>
+
+<ul class="documentation-list">
+  <li><a href="comparisons/introduction.html">Introduction</a></li>
+  <li><a href="comparisons/mupd8.html">MUPD8</a></li>
+  <li><a href="comparisons/storm.html">Storm</a></li>
+  <li><a href="comparisons/spark-streaming.html">Spark Streaming</a></li>
+<!-- TODO comparisons pages
+  <li><a href="comparisons/aurora.html">Aurora</a></li>
+  <li><a href="comparisons/jms.html">JMS</a></li>
+  <li><a href="comparisons/s4.html">S4</a></li>
+-->
+</ul>
+
+<h4>API</h4>
+
+<ul class="documentation-list">
+  <li><a href="api/overview.html">Overview</a></li>
+  <li><a href="api/javadocs">Javadocs</a></li>
+</ul>
+
+<h4>Container</h4>
+
+<ul class="documentation-list">
+  <li><a href="container/samza-container.html">SamzaContainer</a></li>
+  <li><a href="container/streams.html">Streams</a></li>
+  <li><a href="container/serialization.html">Serialization</a></li>
+  <li><a href="container/checkpointing.html">Checkpointing</a></li>
+  <li><a href="container/state-management.html">State Management</a></li>
+  <li><a href="container/metrics.html">Metrics</a></li>
+  <li><a href="container/windowing.html">Windowing</a></li>
+  <li><a href="container/event-loop.html">Event Loop</a></li>
+  <li><a href="container/jmx.html">JMX</a></li>
+</ul>
+
+<h4>Jobs</h4>
+
+<ul class="documentation-list">
+  <li><a href="jobs/job-runner.html">JobRunner</a></li>
+  <li><a href="jobs/configuration.html">Configuration</a></li>
+  <li><a href="jobs/packaging.html">Packaging</a></li>
+  <li><a href="jobs/yarn-jobs.html">YARN Jobs</a></li>
+  <li><a href="jobs/logging.html">Logging</a></li>
+  <li><a href="jobs/reprocessing.html">Reprocessing</a></li>
+</ul>
+
+<h4>YARN</h4>
+
+<ul class="documentation-list">
+  <li><a href="yarn/application-master.html">Application Master</a></li>
+  <li><a href="yarn/isolation.html">Isolation</a></li>
+<!-- TODO write yarn pages
+  <li><a href="">Fault Tolerance</a></li>
+  <li><a href="">Security</a></li>
+-->
+</ul>
+
+<h4>Operations</h4>
+
+<ul class="documentation-list">
+  <li><a href="operations/security.html">Security</a></li>
+  <li><a href="operations/kafka.html">Kafka</a></li>
+</div>

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/introduction/architecture.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/introduction/architecture.md 
b/docs/learn/documentation/versioned/introduction/architecture.md
new file mode 100644
index 0000000..1a160dd
--- /dev/null
+++ b/docs/learn/documentation/versioned/introduction/architecture.md
@@ -0,0 +1,110 @@
+---
+layout: page
+title: Architecture
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Samza is made up of three layers:
+
+1. A streaming layer.
+2. An execution layer.
+3. A processing layer.
+
+Samza provides out of the box support for all three layers.
+
+1. **Streaming:** [Kafka](http://kafka.apache.org/)
+2. **Execution:** 
[YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
+3. **Processing:** [Samza API](../api/overview.html)
+
+These three pieces fit together to form Samza:
+
+![diagram-medium](/img/{{site.version}}/learn/documentation/introduction/samza-ecosystem.png)
+
+This architecture follows a similar pattern to Hadoop (which also uses YARN as 
execution layer, HDFS for storage, and MapReduce as processing API):
+
+![diagram-medium](/img/{{site.version}}/learn/documentation/introduction/samza-hadoop.png)
+
+Before going in-depth on each of these three layers, it should be noted that 
Samza's support is not limited to Kafka and YARN. Both Samza's execution and 
streaming layer are pluggable, and allow developers to implement alternatives 
if they prefer.
+
+### Kafka
+
+[Kafka](http://kafka.apache.org/) is a distributed pub/sub and message 
queueing system that provides at-least once messaging guarantees (i.e. the 
system guarantees that no messages are lost, but in certain fault scenarios, a 
consumer might receive the same message more than once), and highly available 
partitions (i.e. a stream's partitions continue to be available even if a 
machine goes down).
+
+In Kafka, each stream is called a *topic*. Each topic is partitioned and 
replicated across multiple machines called *brokers*. When a *producer* sends a 
message to a topic, it provides a key, which is used to determine which 
partition the message should be sent to. The Kafka brokers receive and store 
the messages that the producer sends. Kafka *consumers* can then read from a 
topic by subscribing to messages on all partitions of a topic.
+
+Kafka has some interesting properties: 
+
+* All messages with the same key are guaranteed to be in the same topic 
partition. This means that if you wish to read all messages for a specific user 
ID, you only have to read the messages from the partition that contains the 
user ID, not the whole topic (assuming the user ID is used as key).
+* A topic partition is a sequence of messages in order of arrival, so you can 
reference any message in the partition using a monotonically increasing 
*offset* (like an index into an array). This means that the broker doesn't need 
to keep track of which messages have been seen by a particular consumer &mdash; 
the consumer can keep track itself by storing the offset of the last message it 
has processed. It then knows that every message with a lower offset than the 
current offset has already been processed; every message with a higher offset 
has not yet been processed.
+
+For more details on Kafka, see Kafka's 
[documentation](http://kafka.apache.org/documentation.html) pages.
+
+### YARN
+
+[YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
 (Yet Another Resource Negotiator) is Hadoop's next-generation cluster 
scheduler. It allows you to allocate a number of *containers* (processes) in a 
cluster of machines, and execute arbitrary commands on them.
+
+When an application interacts with YARN, it looks something like this:
+
+1. **Application**: I want to run command X on two machines with 512MB memory.
+2. **YARN**: Cool, where's your code?
+3. **Application**: http://path.to.host/jobs/download/my.tgz
+4. **YARN**: I'm running your job on node-1.grid and node-2.grid.
+
+Samza uses YARN to manage deployment, fault tolerance, logging, resource 
isolation, security, and locality. A brief overview of YARN is below; see [this 
page from 
Hortonworks](http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/)
 for a much better overview.
+
+#### YARN Architecture
+
+YARN has three important pieces: a *ResourceManager*, a *NodeManager*, and an 
*ApplicationMaster*. In a YARN grid, every machine runs a NodeManager, which is 
responsible for launching processes on that machine. A ResourceManager talks to 
all of the NodeManagers to tell them what to run. Applications, in turn, talk 
to the ResourceManager when they wish to run something on the cluster. The 
third piece, the ApplicationMaster, is actually application-specific code that 
runs in the YARN cluster. It's responsible for managing the application's 
workload, asking for containers (usually UNIX processes), and handling 
notifications when one of its containers fails.
+
+#### Samza and YARN
+
+Samza provides a YARN ApplicationMaster and a YARN job runner out of the box. 
The integration between Samza and YARN is outlined in the following diagram 
(different colors indicate different host machines):
+
+![diagram-small](/img/{{site.version}}/learn/documentation/introduction/samza-yarn-integration.png)
+
+The Samza client talks to the YARN RM when it wants to start a new Samza job. 
The YARN RM talks to a YARN NM to allocate space on the cluster for Samza's 
ApplicationMaster. Once the NM allocates space, it starts the Samza AM. After 
the Samza AM starts, it asks the YARN RM for one or more YARN containers to run 
[SamzaContainers](../container/samza-container.html). Again, the RM works with 
NMs to allocate space for the containers. Once the space has been allocated, 
the NMs start the Samza containers.
+
+### Samza
+
+Samza uses YARN and Kafka to provide a framework for stage-wise stream 
processing and partitioning. Everything, put together, looks like this 
(different colors indicate different host machines):
+
+![diagram-small](/img/{{site.version}}/learn/documentation/introduction/samza-yarn-kafka-integration.png)
+
+The Samza client uses YARN to run a Samza job: YARN starts and supervises one 
or more [SamzaContainers](../container/samza-container.html), and your 
processing code (using the [StreamTask](../api/overview.html) API) runs inside 
those containers. The input and output for the Samza StreamTasks come from 
Kafka brokers that are (usually) co-located on the same machines as the YARN 
NMs.
+
+### Example
+
+Let's take a look at a real example: suppose we want to count the number of 
page views. In SQL, you would write something like:
+
+{% highlight sql %}
+SELECT user_id, COUNT(*) FROM PageViewEvent GROUP BY user_id
+{% endhighlight %}
+
+Although Samza doesn't support SQL right now, the idea is the same. Two jobs 
are required to calculate this query: one to group messages by user ID, and the 
other to do the counting.
+
+In the first job, the grouping is done by sending all messages with the same 
user ID to the same partition of an intermediate topic. You can do this by 
using the user ID as key of the messages that are emitted by the first job, and 
this key is mapped to one of the intermediate topic's partitions (usually by 
taking a hash of the key mod the number of partitions). The second job consumes 
the intermediate topic. Each task in the second job consumes one partition of 
the intermediate topic, i.e. all the messages for a subset of user IDs. The 
task has a counter for each user ID in its partition, and the appropriate 
counter is incremented every time the task receives a message with a particular 
user ID.
+
+<img 
src="/img/{{site.version}}/learn/documentation/introduction/group-by-example.png"
 alt="Repartitioning for a GROUP BY" class="diagram-large">
+
+If you are familiar with Hadoop, you may recognize this as a Map/Reduce 
operation, where each record is associated with a particular key in the 
mappers, records with the same key are grouped together by the framework, and 
then counted in the reduce step. The difference between Hadoop and Samza is 
that Hadoop operates on a fixed input, whereas Samza works with unbounded 
streams of data.
+
+Kafka takes the messages emitted by the first job and buffers them on disk, 
distributed across multiple machines. This helps make the system 
fault-tolerant: if one machine fails, no messages are lost, because they have 
been replicated to other machines. And if the second job goes slow or stops 
consuming messages for any reason, the first job is unaffected: the disk buffer 
can absorb the backlog of messages from the first job until the second job 
catches up again.
+
+By partitioning topics, and by breaking a stream process down into jobs and 
parallel tasks that run on multiple machines, Samza scales to streams with very 
high message throughput. By using YARN and Kafka, Samza achieves 
fault-tolerance: if a process or machine fails, it is automatically restarted 
on another machine and continues processing messages from the point where it 
left off.
+
+## [Comparison Introduction &raquo;](../comparisons/introduction.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/introduction/background.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/introduction/background.md 
b/docs/learn/documentation/versioned/introduction/background.md
new file mode 100644
index 0000000..e09497b
--- /dev/null
+++ b/docs/learn/documentation/versioned/introduction/background.md
@@ -0,0 +1,71 @@
+---
+layout: page
+title: Background
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+This page provides some background about stream processing, describes what 
Samza is, and why it was built.
+
+### What is messaging?
+
+Messaging systems are a popular way of implementing near-realtime asynchronous 
computation. Messages can be added to a message queue (ActiveMQ, RabbitMQ), 
pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when 
something happens. Downstream *consumers* read messages from these systems, and 
process them or take actions based on the message contents.
+
+Suppose you have a website, and every time someone loads a page, you send a 
"user viewed page" event to a messaging system. You might then have consumers 
which do any of the following:
+
+* Store the message in Hadoop for future analysis
+* Count page views and update a dashboard
+* Trigger an alert if a page view fails
+* Send an email notification to another user
+* Join the page view event with the user's profile, and send the message back 
to the messaging system
+
+A messaging system lets you decouple all of this work from the actual web page 
serving.
+
+### What is stream processing?
+
+A messaging system is a fairly low-level piece of infrastructure&mdash;it 
stores messages and waits for consumers to consume them. When you start writing 
code that produces or consumes messages, you quickly find that there are a lot 
of tricky problems that have to be solved in the processing layer. Samza aims 
to help with these problems.
+
+Consider the counting example, above (count page views and update a 
dashboard). What happens when the machine that your consumer is running on 
fails, and your current counter values are lost? How do you recover? Where 
should the processor be run when it restarts? What if the underlying messaging 
system sends you the same message twice, or loses a message? (Unless you are 
careful, your counts will be incorrect.) What if you want to count page views 
grouped by the page URL? How do you distribute the computation across multiple 
machines if it's too much for a single machine to handle?
+
+Stream processing is a higher level of abstraction on top of messaging 
systems, and it's meant to address precisely this category of problems.
+
+### Samza
+
+Samza is a stream processing framework with the following features:
+
+* **Simple API:** Unlike most low-level messaging system APIs, Samza provides 
a very simple callback-based "process message" API comparable to MapReduce.
+* **Managed state:** Samza manages snapshotting and restoration of a stream 
processor's state. When the processor is restarted, Samza restores its state to 
a consistent snapshot. Samza is built to handle large amounts of state (many 
gigabytes per partition).
+* **Fault tolerance:** Whenever a machine in the cluster fails, Samza works 
with YARN to transparently migrate your tasks to another machine.
+* **Durability:** Samza uses Kafka to guarantee that messages are processed in 
the order they were written to a partition, and that no messages are ever lost.
+* **Scalability:** Samza is partitioned and distributed at every level. Kafka 
provides ordered, partitioned, replayable, fault-tolerant streams. YARN 
provides a distributed environment for Samza containers to run in.
+* **Pluggable:** Though Samza works out of the box with Kafka and YARN, Samza 
provides a pluggable API that lets you run Samza with other messaging systems 
and execution environments.
+* **Processor isolation:** Samza works with Apache YARN, which supports 
Hadoop's security model, and resource isolation through Linux CGroups.
+
+### Alternatives
+
+The available open source stream processing systems are actually quite young, 
and no single system offers a complete solution. New problems in this area 
include: how a stream processor's state should be managed, whether or not a 
stream should be buffered remotely on disk, what to do when duplicate messages 
are received or messages are lost, and how to model underlying messaging 
systems.
+
+Samza's main differentiators are:
+
+* Samza supports fault-tolerant local state. State can be thought of as tables 
that are split up and co-located with the processing tasks. State is itself 
modeled as a stream. If the local state is lost due to machine failure, the 
state stream is replayed to restore it.
+* Streams are ordered, partitioned, replayable, and fault tolerant.
+* YARN is used for processor isolation, security, and fault tolerance.
+* Jobs are decoupled: if one job goes slow and builds up a backlog of 
unprocessed messages, the rest of the system is not affected.
+
+For a more in-depth discussion on Samza, and how it relates to other stream 
processing systems, have a look at Samza's 
[Comparisons](../comparisons/introduction.html) documentation.
+
+## [Concepts &raquo;](concepts.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/introduction/concepts.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/introduction/concepts.md 
b/docs/learn/documentation/versioned/introduction/concepts.md
new file mode 100644
index 0000000..25ef5ee
--- /dev/null
+++ b/docs/learn/documentation/versioned/introduction/concepts.md
@@ -0,0 +1,72 @@
+---
+layout: page
+title: Concepts
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+This page gives an introduction to the high-level concepts in Samza.
+
+### Streams
+
+Samza processes *streams*. A stream is composed of immutable *messages* of a 
similar type or category. For example, a stream could be all the clicks on a 
website, or all the updates to a particular database table, or all the logs 
produced by a service, or any other type of event data. Messages can be 
appended to a stream or read from a stream. A stream can have any number of 
*consumers*, and reading from a stream doesn't delete the message (so each 
message is effectively broadcast to all consumers). Messages can optionally 
have an associated key which is used for partitioning, which we'll talk about 
in a second.
+
+Samza supports pluggable *systems* that implement the stream abstraction: in 
[Kafka](https://kafka.apache.org/) a stream is a topic, in a database we might 
read a stream by consuming updates from a table, in Hadoop we might tail a 
directory of files in HDFS.
+
+![job](/img/{{site.version}}/learn/documentation/introduction/job.png)
+
+### Jobs
+
+A Samza *job* is code that performs a logical transformation on a set of input 
streams to append output messages to set of output streams.
+
+If scalability were not a concern, streams and jobs would be all we need. 
However, in order to scale the throughput of the stream processor, we chop 
streams and jobs up into smaller units of parallelism: *partitions* and *tasks*.
+
+### Partitions
+
+Each stream is broken into one or more partitions. Each partition in the 
stream is a totally ordered sequence of messages.
+
+Each message in this sequence has an identifier called the *offset*, which is 
unique per partition. The offset can be a sequential integer, byte offset, or 
string depending on the underlying system implementation.
+
+When a message is appended to a stream, it is appended to only one of the 
stream's partitions. The assignment of the message to its partition is done 
with a key chosen by the writer. For example, if the user ID is used as the 
key, that ensures that all messages related to a particular user end up in the 
same partition.
+
+![stream](/img/{{site.version}}/learn/documentation/introduction/stream.png)
+
+### Tasks
+
+A job is scaled by breaking it into multiple *tasks*. The *task* is the unit 
of parallelism of the job, just as the partition is to the stream. Each task 
consumes data from one partition for each of the job's input streams.
+
+A task processes messages from each of its input partitions sequentially, in 
the order of message offset. There is no defined ordering across partitions. 
This allows each task to operate independently. The YARN scheduler assigns each 
task to a machine, so the job as a whole can be distributed across many 
machines.
+
+The number of tasks in a job is determined by the number of input partitions 
(there cannot be more tasks than input partitions, or there would be some tasks 
with no input). However, you can change the computational resources assigned to 
the job (the amount of memory, number of CPU cores, etc.) to satisfy the job's 
needs. See notes on *containers* below.
+
+The assignment of partitions to tasks never changes: if a task is on a machine 
that fails, the task is restarted elsewhere, still consuming the same stream 
partitions.
+
+![job-detail](/img/{{site.version}}/learn/documentation/introduction/job_detail.png)
+
+### Dataflow Graphs
+
+We can compose multiple jobs to create a dataflow graph, where the nodes are 
streams containing data, and the edges are jobs performing transformations. 
This composition is done purely through the streams the jobs take as input and 
output. The jobs are otherwise totally decoupled: they need not be implemented 
in the same code base, and adding, removing, or restarting a downstream job 
will not impact an upstream job.
+
+These graphs are often acyclic&mdash;that is, data usually doesn't flow from a 
job, through other jobs, back to itself. However, it is possible to create 
cyclic graphs if you need to.
+
+<img src="/img/{{site.version}}/learn/documentation/introduction/dag.png" 
width="430" alt="Directed acyclic job graph">
+
+### Containers
+
+Partitions and tasks are both *logical* units of parallelism&mdash;they don't 
correspond to any particular assignment of computational resources (CPU, 
memory, disk space, etc). Containers are the unit of physical parallelism, and 
a container is essentially a Unix process (or Linux 
[cgroup](http://en.wikipedia.org/wiki/Cgroups)). Each container runs one or 
more tasks. The number of tasks is determined automatically from the number of 
partitions in the input and is fixed, but the number of containers (and the CPU 
and memory resources associated with them) is specified by the user at run time 
and can be changed at any time.
+
+## [Architecture &raquo;](architecture.html)

[04/39] SAMZA-259: Restructure documentation folders

Reply via email to