http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/0.7.0/container/state-management.md ---------------------------------------------------------------------- diff --git a/docs/learn/documentation/0.7.0/container/state-management.md b/docs/learn/documentation/0.7.0/container/state-management.md deleted file mode 100644 index e54739c..0000000 --- a/docs/learn/documentation/0.7.0/container/state-management.md +++ /dev/null @@ -1,238 +0,0 @@ ---- -layout: page -title: State Management ---- -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -One of the more interesting features of Samza is stateful stream processing. Tasks can store and query data through APIs provided by Samza. That data is stored on the same machine as the stream task; compared to connecting over the network to a remote database, Samza's local state allows you to read and write large amounts of data with better performance. Samza replicates this state across multiple machines for fault-tolerance (described in detail below). - -Some stream processing jobs don't require state: if you only need to transform one message at a time, or filter out messages based on some condition, your job can be simple. Every call to your task's [process method](../api/overview.html) handles one incoming message, and each message is independent of all the other messages. - -However, being able to maintain state opens up many possibilities for sophisticated stream processing jobs: joining input streams, grouping messages and aggregating groups of messages. By analogy to SQL, the *select* and *where* clauses of a query are usually stateless, but *join*, *group by* and aggregation functions like *sum* and *count* require state. Samza doesn't yet provide a higher-level SQL-like language, but it does provide lower-level primitives that you can use to implement streaming aggregation and joins. - -### Common use cases for stateful processing - -First, let's look at some simple examples of stateful stream processing that might be seen in the backend of a consumer website. Later in this page we'll discuss how to implement these applications using Samza's built-in key-value storage capabilities. - -#### Windowed aggregation - -*Example: Counting the number of page views for each user per hour* - -In this case, your state typically consists of a number of counters which are incremented when a message is processed. The aggregation is typically limited to a time window (e.g. 1 minute, 1 hour, 1 day) so that you can observe changes of activity over time. This kind of windowed processing is common for ranking and relevance, detecting "trending topics", as well as real-time reporting and monitoring. - -The simplest implementation keeps this state in memory (e.g. a hash map in the task instances), and writes it to a database or output stream at the end of every time window. However, you need to consider what happens when a container fails and your in-memory state is lost. You might be able to restore it by processing all the messages in the current window again, but that might take a long time if the window covers a long period of time. Samza can speed up this recovery by making the state fault-tolerant rather than trying to recompute it. - -#### Table-table join - -*Example: Join a table of user profiles to a table of user settings by user\_id and emit the joined stream* - -You might wonder: does it make sense to join two tables in a stream processing system? It does if your database can supply a log of all the changes in the database. There is a [duality between a database and a changelog stream](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying): you can publish every data change to a stream, and if you consume the entire stream from beginning to end, you can reconstruct the entire contents of the database. Samza is designed for data processing jobs that follow this philosophy. - -If you have changelog streams for several database tables, you can write a stream processing job which keeps the latest state of each table in a local key-value store, where you can access it much faster than by making queries to the original database. Now, whenever data in one table changes, you can join it with the latest data for the same key in the other table, and output the joined result. - -There are several real-life examples of data normalization which essentially work in this way: - -* E-commerce companies like Amazon and EBay need to import feeds of merchandise from merchants, normalize them by product, and present products with all the associated merchants and pricing information. -* Web search requires building a crawler which creates essentially a [table of web page contents](http://labs.yahoo.com/files/YahooWebmap.pdf) and joins on all the relevance attributes such as click-through ratio or pagerank. -* Social networks take feeds of user-entered text and need to normalize out entities such as companies, schools, and skills. - -Each of these use cases is a massively complex data normalization problem that can be thought of as constructing a materialized view over many input tables. Samza can help implement such data processing pipelines robustly. - -#### Stream-table join - -*Example: Augment a stream of page view events with the user's ZIP code (perhaps to allow aggregation by zip code in a later stage)* - -Joining side-information to a real-time feed is a classic use for stream processing. It's particularly common in advertising, relevance ranking, fraud detection and other domains. Activity events such as page views generally only include a small number of attributes, such as the ID of the viewer and the viewed items, but not detailed attributes of the viewer and the viewed items, such as the ZIP code of the user. If you want to aggregate the stream by attributes of the viewer or the viewed items, you need to join with the users table or the items table respectively. - -In data warehouse terminology, you can think of the raw event stream as rows in the central fact table, which needs to be joined with dimension tables so that you can use attributes of the dimensions in your analysis. - -#### Stream-stream join - -*Example: Join a stream of ad clicks to a stream of ad impressions (to link the information on when the ad was shown to the information on when it was clicked)* - -A stream join is useful for "nearly aligned" streams, where you expect to receive related events on several input streams, and you want to combine them into a single output event. You cannot rely on the events arriving at the stream processor at the same time, but you can set a maximum period of time over which you allow the events to be spread out. - -In order to perform a join between streams, your job needs to buffer events for the time window over which you want to join. For short time windows, you can do this in memory (at the risk of losing events if the machine fails). You can also use Samza's state store to buffer events, which supports buffering more messages than you can fit in memory. - -#### More - -There are many variations of joins and aggregations, but most are essentially variations and combinations of the above patterns. - -### Approaches to managing task state - -So how do systems support this kind of stateful processing? We'll lead in by describing what we have seen in other stream processing systems, and then describe what Samza does. - -#### In-memory state with checkpointing - -A simple approach, common in academic stream processing systems, is to periodically save the task's entire in-memory data to durable storage. This approach works well if the in-memory state consists of only a few values. However, you have to store the complete task state on each checkpoint, which becomes increasingly expensive as task state grows. Unfortunately, many non-trivial use cases for joins and aggregation have large amounts of state — often many gigabytes. This makes full dumps of the state impractical. - -Some academic systems produce *diffs* in addition to full checkpoints, which are smaller if only some of the state has changed since the last checkpoint. [Storm's Trident abstraction](../comparisons/storm.html) similarly keeps an in-memory cache of state, and periodically writes any changes to a remote store such as Cassandra. However, this optimization only helps if most of the state remains unchanged. In some use cases, such as stream joins, it is normal to have a lot of churn in the state, so this technique essentially degrades to making a remote database request for every message (see below). - -#### Using an external store - -Another common pattern for stateful processing is to store the state in an external database or key-value store. Conventional database replication can be used to make that database fault-tolerant. The architecture looks something like this: - - - -Samza allows this style of processing — there is nothing to stop you querying a remote database or service within your job. However, there are a few reasons why a remote database can be problematic for stateful stream processing: - -1. **Performance**: Making database queries over a network is slow and expensive. A Kafka stream can deliver hundreds of thousands or even millions of messages per second per CPU core to a stream processor, but if you need to make a remote request for every message you process, your throughput is likely to drop by 2-3 orders of magnitude. You can somewhat mitigate this with careful caching of reads and batching of writes, but then you're back to the problems of checkpointing, discussed above. -2. **Isolation**: If your database or service also serves requests to users, it can be dangerous to use the same database with a stream processor. A scalable stream processing system can run with very high throughput, and easily generates a huge amount of load (for example when catching up on a queue backlog). If you're not very careful, you may cause a denial-of-service attack on your own database, and cause problems for interactive requests from users. -3. **Query Capabilities**: Many scalable databases expose very limited query interfaces (e.g. only supporting simple key-value lookups), because the equivalent of a "full table scan" or rich traversal would be too expensive. Stream processes are often less latency-sensitive, so richer query capabilities would be more feasible. -4. **Correctness**: When a stream processor fails and needs to be restarted, how is the database state made consistent with the processing task? For this purpose, some frameworks such as [Storm](../comparisons/storm.html) attach metadata to database entries, but it needs to be handled carefully, otherwise the stream process generates incorrect output. -5. **Reprocessing**: Sometimes it can be useful to re-run a stream process on a large amount of historical data, e.g. after updating your processing task's code. However, the issues above make this impractical for jobs that make external queries. - -### Local state in Samza - -Samza allows tasks to maintain state in a way that is different from the approaches described above: - -* The state is stored on disk, so the job can maintain more state than would fit in memory. -* It is stored on the same machine as the processing task, to avoid the performance problems of making database queries over the network. -* Each job has its own datastore, to avoid the isolation problems of a shared database (if you make an expensive query, it affects only the current task, nobody else). -* Different storage engines can be plugged in, enabling rich query capabilities. -* The state is continuously replicated, enabling fault tolerance without the problems of checkpointing large amounts of state. - -Imagine you take a remote database, partition it to match the number of tasks in the stream processing job, and co-locate each partition with its task. The result looks like this: - - - -If a machine fails, all the tasks running on that machine and their database partitions are lost. In order to make them highly available, all writes to the database partition are replicated to a durable changelog (typically Kafka). Now, when a machine fails, we can restart the tasks on another machine, and consume this changelog in order to restore the contents of the database partition. - -Note that each task only has access to its own database partition, not to any other task's partition. This is important: when you scale out your job by giving it more computing resources, Samza needs to move tasks from one machine to another. By giving each task its own state, tasks can be relocated without affecting the job's operation. If necessary, you can repartition your streams so that all messages for a particular database partition are routed to the same task instance. - -[Log compaction](http://kafka.apache.org/documentation.html#compaction) runs in the background on the changelog topic, and ensures that the changelog does not grow indefinitely. If you overwrite the same value in the store many times, log compaction keeps only the most recent value, and throws away any old values in the log. If you delete an item from the store, log compaction also removes it from the log. With the right tuning, the changelog is not much bigger than the database itself. - -With this architecture, Samza allows tasks to maintain large amounts of fault-tolerant state, at a performance that is almost as good as a pure in-memory implementation. There are just a few limitations: - -* If you have some data that you want to share between tasks (across partition boundaries), you need to go to some additional effort to repartition and distribute the data. Each task will need its own copy of the data, so this may use more space overall. -* When a container is restarted, it can take some time to restore the data in all of its state partitions. The time depends on the amount of data, the storage engine, your access patterns, and other factors. As a rule of thumb, 50 MB/sec is a reasonable restore time to expect. - -Nothing prevents you from using an external database if you want to, but for many use cases, Samza's local state is a powerful tool for enabling stateful stream processing. - -### Key-value storage - -Any storage engine can be plugged into Samza, as described below. Out of the box, Samza ships with a key-value store implementation that is built on [LevelDB](https://code.google.com/p/leveldb) using a [JNI API](https://github.com/fusesource/leveldbjni). - -LevelDB has several nice properties. Its memory allocation is outside of the Java heap, which makes it more memory-efficient and less prone to garbage collection pauses than a Java-based storage engine. It is very fast for small datasets that fit in memory; datasets larger than memory are slower but still possible. It is [log-structured](http://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/), allowing very fast writes. It also includes support for block compression, which helps to reduce I/O and memory usage. - -Samza includes an additional in-memory caching layer in front of LevelDB, which avoids the cost of deserialization for frequently-accessed objects and batches writes. If the same key is updated multiple times in quick succession, the batching coalesces those updates into a single write. The writes are flushed to the changelog when a task [commits](checkpointing.html). - -To use a key-value store in your job, add the following to your job config: - -{% highlight jproperties %} -# Use the key-value store implementation for a store called "my-store" -stores.my-store.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory - -# Use the Kafka topic "my-store-changelog" as the changelog stream for this store. -# This enables automatic recovery of the store after a failure. If you don't -# configure this, no changelog stream will be generated. -stores.my-store.changelog=kafka.my-store-changelog - -# Encode keys and values in the store as UTF-8 strings. -serializers.registry.string.class=org.apache.samza.serializers.StringSerdeFactory -stores.my-store.key.serde=string -stores.my-store.msg.serde=string -{% endhighlight %} - -See the [serialization section](serialization.html) for more information on the *serde* options. - -Here is a simple example that writes every incoming message to the store: - -{% highlight java %} -public class MyStatefulTask implements StreamTask, InitableTask { - private KeyValueStore<String, String> store; - - public void init(Config config, TaskContext context) { - this.store = (KeyValueStore<String, String>) context.getStore("my-store"); - } - - public void process(IncomingMessageEnvelope envelope, - MessageCollector collector, - TaskCoordinator coordinator) { - store.put((String) envelope.getKey(), (String) envelope.getMessage()); - } -} -{% endhighlight %} - -Here is the complete key-value store API: - -{% highlight java %} -public interface KeyValueStore<K, V> { - V get(K key); - void put(K key, V value); - void putAll(List<Entry<K,V>> entries); - void delete(K key); - KeyValueIterator<K,V> range(K from, K to); - KeyValueIterator<K,V> all(); -} -{% endhighlight %} - -Additional configuration properties for the key-value store are documented in the [configuration reference](../jobs/configuration-table.html#keyvalue). - -### Implementing common use cases with the key-value store - -Earlier in this section we discussed some example use cases for stateful stream processing. Let's look at how each of these could be implemented using a key-value storage engine such as Samza's LevelDB. - -#### Windowed aggregation - -*Example: Counting the number of page views for each user per hour* - -Implementation: You need two processing stages. - -1. The first one re-partitions the input data by user ID, so that all the events for a particular user are routed to the same stream task. If the input stream is already partitioned by user ID, you can skip this. -2. The second stage does the counting, using a key-value store that maps a user ID to the running count. For each new event, the job reads the current count for the appropriate user from the store, increments it, and writes it back. When the window is complete (e.g. at the end of an hour), the job iterates over the contents of the store and emits the aggregates to an output stream. - -Note that this job effectively pauses at the hour mark to output its results. This is totally fine for Samza, as scanning over the contents of the key-value store is quite fast. The input stream is buffered while the job is doing this hourly work. - -#### Table-table join - -*Example: Join a table of user profiles to a table of user settings by user\_id and emit the joined stream* - -Implementation: The job subscribes to the change streams for the user profiles database and the user settings database, both partitioned by user\_id. The job keeps a key-value store keyed by user\_id, which contains the latest profile record and the latest settings record for each user\_id. When a new event comes in from either stream, the job looks up the current value in its store, updates the appropriate fields (depending on whether it was a profile update or a settings update), and writes back the new joined record to the store. The changelog of the store doubles as the output stream of the task. - -#### Table-stream join - -*Example: Augment a stream of page view events with the user's ZIP code (perhaps to allow aggregation by zip code in a later stage)* - -Implementation: The job subscribes to the stream of user profile updates and the stream of page view events. Both streams must be partitioned by user\_id. The job maintains a key-value store where the key is the user\_id and the value is the user's ZIP code. Every time the job receives a profile update, it extracts the user's new ZIP code from the profile update and writes it to the store. Every time it receives a page view event, it reads the zip code for that user from the store, and emits the page view event with an added ZIP code field. - -If the next stage needs to aggregate by ZIP code, the ZIP code can be used as the partitioning key of the job's output stream. That ensures that all the events for the same ZIP code are sent to the same stream partition. - -#### Stream-stream join - -*Example: Join a stream of ad clicks to a stream of ad impressions (to link the information on when the ad was shown to the information on when it was clicked)* - -In this example we assume that each impression of an ad has a unique identifier, e.g. a UUID, and that the same identifier is included in both the impression and the click events. This identifier is used as the join key. - -Implementation: Partition the ad click and ad impression streams by the impression ID or user ID (assuming that two events with the same impression ID always have the same user ID). The task keeps two stores, one containing click events and one containing impression events, using the impression ID as key for both stores. When the job receives a click event, it looks for the corresponding impression in the impression store, and vice versa. If a match is found, the joined pair is emitted and the entry is deleted. If no match is found, the event is written to the appropriate store. Periodically the job scans over both stores and deletes any old events that were not matched within the time window of the join. - -### Other storage engines - -Samza's fault-tolerance mechanism (sending a local store's writes to a replicated changelog) is completely decoupled from the storage engine's data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the [StorageEngine](../api/javadocs/org/apache/samza/storage/StorageEngine.html) interface. Samza's model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task. - -Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), [approximate algorithms](http://infolab.stanford.edu/~ullman/mmds/ch4.pdf) such as [bloom filters](http://en.wikipedia.org/wiki/Bloom_filter) and [hyperloglog](http://research.google.com/pubs/pub40671.html), or full-text indexes such as [Lucene](http://lucene.apache.org). (Patches accepted!) - -### Fault tolerance semantics with state - -As discussed in the section on [checkpointing](checkpointing.html), Samza currently only supports at-least-once delivery guarantees in the presence of failure (this is sometimes referred to as "guaranteed delivery"). This means that if a task fails, no messages are lost, but some messages may be redelivered. - -For many of the stateful processing use cases discussed above, this is not a problem: if the effect of a message on state is idempotent, it is safe for the same message to be processed more than once. For example, if the store contains the ZIP code for each user, then processing the same profile update twice has no effect, because the duplicate update does not change the ZIP code. - -However, for non-idempotent operations such as counting, at-least-once delivery guarantees can give incorrect results. If a Samza task fails and is restarted, it may double-count some messages that were processed shortly before the failure. We are planning to address this limitation in a future release of Samza. - -## [Metrics »](metrics.html)
http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/0.7.0/container/streams.md ---------------------------------------------------------------------- diff --git a/docs/learn/documentation/0.7.0/container/streams.md b/docs/learn/documentation/0.7.0/container/streams.md deleted file mode 100644 index 59e0855..0000000 --- a/docs/learn/documentation/0.7.0/container/streams.md +++ /dev/null @@ -1,139 +0,0 @@ ---- -layout: page -title: Streams ---- -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -The [samza container](samza-container.html) reads and writes messages using the [SystemConsumer](../api/javadocs/org/apache/samza/system/SystemConsumer.html) and [SystemProducer](../api/javadocs/org/apache/samza/system/SystemProducer.html) interfaces. You can integrate any message broker with Samza by implementing these two interfaces. - -{% highlight java %} -public interface SystemConsumer { - void start(); - - void stop(); - - void register( - SystemStreamPartition systemStreamPartition, - String lastReadOffset); - - List<IncomingMessageEnvelope> poll( - Map<SystemStreamPartition, Integer> systemStreamPartitions, - long timeout) - throws InterruptedException; -} - -public class IncomingMessageEnvelope { - public Object getMessage() { ... } - - public Object getKey() { ... } - - public SystemStreamPartition getSystemStreamPartition() { ... } -} - -public interface SystemProducer { - void start(); - - void stop(); - - void register(String source); - - void send(String source, OutgoingMessageEnvelope envelope); - - void flush(String source); -} - -public class OutgoingMessageEnvelope { - ... - public Object getKey() { ... } - - public Object getMessage() { ... } -} -{% endhighlight %} - -Out of the box, Samza supports Kafka (KafkaSystemConsumer and KafkaSystemProducer). However, any message bus system can be plugged in, as long as it can provide the semantics required by Samza, as described in the [javadoc](../api/javadocs/org/apache/samza/system/SystemConsumer.html). - -SystemConsumers and SystemProducers may read and write messages of any data type. It's ok if they only support byte arrays — Samza has a separate [serialization layer](serialization.html) which converts to and from objects that application code can use. Samza does not prescribe any particular data model or serialization format. - -The job configuration file can include properties that are specific to a particular consumer and producer implementation. For example, the configuration would typically indicate the hostname and port of the message broker to use, and perhaps connection options. - -### How streams are processed - -If a job is consuming messages from more than one input stream, and all input streams have messages available, messages are processed in a round robin fashion by default. For example, if a job is consuming AdImpressionEvent and AdClickEvent, the task instance's process() method is called with a message from AdImpressionEvent, then a message from AdClickEvent, then another message from AdImpressionEvent, ... and continues to alternate between the two. - -If one of the input streams has no new messages available (the most recent message has already been consumed), that stream is skipped, and the job continues to consume from the other inputs. It continues to check for new messages becoming available. - -#### MessageChooser - -When a Samza container has several incoming messages on different stream partitions, how does it decide which to process first? The behavior is determined by a [MessageChooser](../api/javadocs/org/apache/samza/system/chooser/MessageChooser.html). The default chooser is RoundRobinChooser, but you can override it by implementing a custom chooser. - -To plug in your own message chooser, you need to implement the [MessageChooserFactory](../api/javadocs/org/apache/samza/system/chooser/MessageChooserFactory.html) interface, and set the "task.chooser.class" configuration to the fully-qualified class name of your implementation: - -{% highlight jproperties %} -task.chooser.class=com.example.samza.YourMessageChooserFactory -{% endhighlight %} - -#### Prioritizing input streams - -There are certain times when messages from one stream should be processed with higher priority than messages from another stream. For example, some Samza jobs consume two streams: one stream is fed by a real-time system and the other stream is fed by a batch system. In this case, it's useful to prioritize the real-time stream over the batch stream, so that the real-time processing doesn't slow down if there is a sudden burst of data on the batch stream. - -Samza provides a mechanism to prioritize one stream over another by setting this configuration parameter: systems.<system>.streams.<stream>.samza.priority=<number>. For example: - -{% highlight jproperties %} -systems.kafka.streams.my-real-time-stream.samza.priority=2 -systems.kafka.streams.my-batch-stream.samza.priority=1 -{% endhighlight %} - -This declares that my-real-time-stream's messages should be processed with higher priority than my-batch-stream's messages. If my-real-time-stream has any messages available, they are processed first. Only if there are no messages currently waiting on my-real-time-stream, the Samza job continues processing my-batch-stream. - -Each priority level gets its own MessageChooser. It is valid to define two streams with the same priority. If messages are available from two streams at the same priority level, it's up to the MessageChooser for that priority level to decide which message should be processed first. - -It's also valid to only define priorities for some streams. All non-prioritized streams are treated as the lowest priority, and share a MessageChooser. - -#### Bootstrapping - -Sometimes, a Samza job needs to fully consume a stream (from offset 0 up to the most recent message) before it processes messages from any other stream. This is useful in situations where the stream contains some prerequisite data that the job needs, and it doesn't make sense to process messages from other streams until the job has loaded that prerequisite data. Samza supports this use case with *bootstrap streams*. - -A bootstrap stream seems similar to a stream with a high priority, but is subtly different. Before allowing any other stream to be processed, a bootstrap stream waits for the consumer to explicitly confirm that the stream has been fully consumed. Until then, the bootstrap stream is the exclusive input to the job: even if a network issue or some other factor causes the bootstrap stream consumer to slow down, other inputs can't sneak their messages in. - -Another difference between a bootstrap stream and a high-priority stream is that the bootstrap stream's special treatment is temporary: when it has been fully consumed (we say it has "caught up"), its priority drops to be the same as all the other input streams. - -To configure a stream called "my-bootstrap-stream" to be a fully-consumed bootstrap stream, use the following settings: - -{% highlight jproperties %} -systems.kafka.streams.my-bootstrap-stream.samza.bootstrap=true -systems.kafka.streams.my-bootstrap-stream.samza.reset.offset=true -systems.kafka.streams.my-bootstrap-stream.samza.offset.default=oldest -{% endhighlight %} - -The bootstrap=true parameter enables the bootstrap behavior (prioritization over other streams). The combination of reset.offset=true and offset.default=oldest tells Samza to always start reading the stream from the oldest offset, every time a container starts up (rather than starting to read from the most recent checkpoint). - -It is valid to define multiple bootstrap streams. In this case, the order in which they are bootstrapped is determined by the priority. - -#### Batching - -In some cases, you can improve performance by consuming several messages from the same stream partition in sequence. Samza supports this mode of operation, called *batching*. - -For example, if you want to read 100 messages in a row from each stream partition (regardless of the MessageChooser), you can use this configuration parameter: - -{% highlight jproperties %} -task.consumer.batch.size=100 -{% endhighlight %} - -With this setting, Samza tries to read a message from the most recently used [SystemStreamPartition](../api/javadocs/org/apache/samza/system/SystemStreamPartition.html). This behavior continues either until no more messages are available for that SystemStreamPartition, or until the batch size has been reached. When that happens, Samza defers to the MessageChooser to determine the next message to process. It then again tries to continue consume from the chosen message's SystemStreamPartition until the batch size is reached. - -## [Serialization »](serialization.html) http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/0.7.0/container/windowing.md ---------------------------------------------------------------------- diff --git a/docs/learn/documentation/0.7.0/container/windowing.md b/docs/learn/documentation/0.7.0/container/windowing.md deleted file mode 100644 index b10e5d4..0000000 --- a/docs/learn/documentation/0.7.0/container/windowing.md +++ /dev/null @@ -1,61 +0,0 @@ ---- -layout: page -title: Windowing ---- -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -Sometimes a stream processing job needs to do something in regular time intervals, regardless of how many incoming messages the job is processing. For example, say you want to report the number of page views per minute. To do this, you increment a counter every time you see a page view event. Once per minute, you send the current counter value to an output stream and reset the counter to zero. - -Samza's *windowing* feature provides a way for tasks to do something in regular time intervals, for example once per minute. To enable windowing, you just need to set one property in your job configuration: - -{% highlight jproperties %} -# Call the window() method every 60 seconds -task.window.ms=60000 -{% endhighlight %} - -Next, your stream task needs to implement the [WindowableTask](../api/javadocs/org/apache/samza/task/WindowableTask.html) interface. This interface defines a window() method which is called by Samza in the regular interval that you configured. - -For example, this is how you would implement a basic per-minute event counter: - -{% highlight java %} -public class EventCounterTask implements StreamTask, WindowableTask { - - public static final SystemStream OUTPUT_STREAM = - new SystemStream("kafka", "events-per-minute"); - - private int eventsSeen = 0; - - public void process(IncomingMessageEnvelope envelope, - MessageCollector collector, - TaskCoordinator coordinator) { - eventsSeen++; - } - - public void window(MessageCollector collector, - TaskCoordinator coordinator) { - collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, eventsSeen)); - eventsSeen = 0; - } -} -{% endhighlight %} - -If you need to send messages to output streams, you can use the [MessageCollector](../api/javadocs/org/apache/samza/task/MessageCollector.html) object passed to the window() method. Please only use that MessageCollector object for sending messages, and don't use it outside of the call to window(). - -Note that Samza uses [single-threaded execution](event-loop.html), so the window() call can never happen concurrently with a process() call. This has the advantage that you don't need to worry about thread safety in your code (no need to synchronize anything), but the downside that the window() call may be delayed if your process() method takes a long time to return. - -## [Event Loop »](event-loop.html) http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/0.7.0/index.html ---------------------------------------------------------------------- diff --git a/docs/learn/documentation/0.7.0/index.html b/docs/learn/documentation/0.7.0/index.html deleted file mode 100644 index 626631b..0000000 --- a/docs/learn/documentation/0.7.0/index.html +++ /dev/null @@ -1,92 +0,0 @@ ---- -layout: page -title: Documentation ---- -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -<h4>Introduction</h4> - -<ul class="documentation-list"> - <li><a href="introduction/background.html">Background</a></li> - <li><a href="introduction/concepts.html">Concepts</a></li> - <li><a href="introduction/architecture.html">Architecture</a></li> -</ul> - -<h4>Comparisons</h4> - -<ul class="documentation-list"> - <li><a href="comparisons/introduction.html">Introduction</a></li> - <li><a href="comparisons/mupd8.html">MUPD8</a></li> - <li><a href="comparisons/storm.html">Storm</a></li> - <li><a href="comparisons/spark-streaming.html">Spark Streaming</a></li> -<!-- TODO comparisons pages - <li><a href="comparisons/aurora.html">Aurora</a></li> - <li><a href="comparisons/jms.html">JMS</a></li> - <li><a href="comparisons/s4.html">S4</a></li> ---> -</ul> - -<h4>API</h4> - -<ul class="documentation-list"> - <li><a href="api/overview.html">Overview</a></li> - <li><a href="api/javadocs">Javadocs</a></li> -</ul> - -<h4>Container</h4> - -<ul class="documentation-list"> - <li><a href="container/samza-container.html">SamzaContainer</a></li> - <li><a href="container/streams.html">Streams</a></li> - <li><a href="container/serialization.html">Serialization</a></li> - <li><a href="container/checkpointing.html">Checkpointing</a></li> - <li><a href="container/state-management.html">State Management</a></li> - <li><a href="container/metrics.html">Metrics</a></li> - <li><a href="container/windowing.html">Windowing</a></li> - <li><a href="container/event-loop.html">Event Loop</a></li> - <li><a href="container/jmx.html">JMX</a></li> -</ul> - -<h4>Jobs</h4> - -<ul class="documentation-list"> - <li><a href="jobs/job-runner.html">JobRunner</a></li> - <li><a href="jobs/configuration.html">Configuration</a></li> - <li><a href="jobs/packaging.html">Packaging</a></li> - <li><a href="jobs/yarn-jobs.html">YARN Jobs</a></li> - <li><a href="jobs/logging.html">Logging</a></li> - <li><a href="jobs/reprocessing.html">Reprocessing</a></li> -</ul> - -<h4>YARN</h4> - -<ul class="documentation-list"> - <li><a href="yarn/application-master.html">Application Master</a></li> - <li><a href="yarn/isolation.html">Isolation</a></li> -<!-- TODO write yarn pages - <li><a href="">Fault Tolerance</a></li> - <li><a href="">Security</a></li> ---> -</ul> - -<h4>Operations</h4> - -<ul class="documentation-list"> - <li><a href="operations/security.html">Security</a></li> - <li><a href="operations/kafka.html">Kafka</a></li> -</div> http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/0.7.0/introduction/architecture.md ---------------------------------------------------------------------- diff --git a/docs/learn/documentation/0.7.0/introduction/architecture.md b/docs/learn/documentation/0.7.0/introduction/architecture.md deleted file mode 100644 index 46987e5..0000000 --- a/docs/learn/documentation/0.7.0/introduction/architecture.md +++ /dev/null @@ -1,110 +0,0 @@ ---- -layout: page -title: Architecture ---- -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -Samza is made up of three layers: - -1. A streaming layer. -2. An execution layer. -3. A processing layer. - -Samza provides out of the box support for all three layers. - -1. **Streaming:** [Kafka](http://kafka.apache.org/) -2. **Execution:** [YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) -3. **Processing:** [Samza API](../api/overview.html) - -These three pieces fit together to form Samza: - - - -This architecture follows a similar pattern to Hadoop (which also uses YARN as execution layer, HDFS for storage, and MapReduce as processing API): - - - -Before going in-depth on each of these three layers, it should be noted that Samza's support is not limited to Kafka and YARN. Both Samza's execution and streaming layer are pluggable, and allow developers to implement alternatives if they prefer. - -### Kafka - -[Kafka](http://kafka.apache.org/) is a distributed pub/sub and message queueing system that provides at-least once messaging guarantees (i.e. the system guarantees that no messages are lost, but in certain fault scenarios, a consumer might receive the same message more than once), and highly available partitions (i.e. a stream's partitions continue to be available even if a machine goes down). - -In Kafka, each stream is called a *topic*. Each topic is partitioned and replicated across multiple machines called *brokers*. When a *producer* sends a message to a topic, it provides a key, which is used to determine which partition the message should be sent to. The Kafka brokers receive and store the messages that the producer sends. Kafka *consumers* can then read from a topic by subscribing to messages on all partitions of a topic. - -Kafka has some interesting properties: - -* All messages with the same key are guaranteed to be in the same topic partition. This means that if you wish to read all messages for a specific user ID, you only have to read the messages from the partition that contains the user ID, not the whole topic (assuming the user ID is used as key). -* A topic partition is a sequence of messages in order of arrival, so you can reference any message in the partition using a monotonically increasing *offset* (like an index into an array). This means that the broker doesn't need to keep track of which messages have been seen by a particular consumer — the consumer can keep track itself by storing the offset of the last message it has processed. It then knows that every message with a lower offset than the current offset has already been processed; every message with a higher offset has not yet been processed. - -For more details on Kafka, see Kafka's [documentation](http://kafka.apache.org/documentation.html) pages. - -### YARN - -[YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) (Yet Another Resource Negotiator) is Hadoop's next-generation cluster scheduler. It allows you to allocate a number of *containers* (processes) in a cluster of machines, and execute arbitrary commands on them. - -When an application interacts with YARN, it looks something like this: - -1. **Application**: I want to run command X on two machines with 512MB memory. -2. **YARN**: Cool, where's your code? -3. **Application**: http://path.to.host/jobs/download/my.tgz -4. **YARN**: I'm running your job on node-1.grid and node-2.grid. - -Samza uses YARN to manage deployment, fault tolerance, logging, resource isolation, security, and locality. A brief overview of YARN is below; see [this page from Hortonworks](http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/) for a much better overview. - -#### YARN Architecture - -YARN has three important pieces: a *ResourceManager*, a *NodeManager*, and an *ApplicationMaster*. In a YARN grid, every machine runs a NodeManager, which is responsible for launching processes on that machine. A ResourceManager talks to all of the NodeManagers to tell them what to run. Applications, in turn, talk to the ResourceManager when they wish to run something on the cluster. The third piece, the ApplicationMaster, is actually application-specific code that runs in the YARN cluster. It's responsible for managing the application's workload, asking for containers (usually UNIX processes), and handling notifications when one of its containers fails. - -#### Samza and YARN - -Samza provides a YARN ApplicationMaster and a YARN job runner out of the box. The integration between Samza and YARN is outlined in the following diagram (different colors indicate different host machines): - - - -The Samza client talks to the YARN RM when it wants to start a new Samza job. The YARN RM talks to a YARN NM to allocate space on the cluster for Samza's ApplicationMaster. Once the NM allocates space, it starts the Samza AM. After the Samza AM starts, it asks the YARN RM for one or more YARN containers to run [SamzaContainers](../container/samza-container.html). Again, the RM works with NMs to allocate space for the containers. Once the space has been allocated, the NMs start the Samza containers. - -### Samza - -Samza uses YARN and Kafka to provide a framework for stage-wise stream processing and partitioning. Everything, put together, looks like this (different colors indicate different host machines): - - - -The Samza client uses YARN to run a Samza job: YARN starts and supervises one or more [SamzaContainers](../container/samza-container.html), and your processing code (using the [StreamTask](../api/overview.html) API) runs inside those containers. The input and output for the Samza StreamTasks come from Kafka brokers that are (usually) co-located on the same machines as the YARN NMs. - -### Example - -Let's take a look at a real example: suppose we want to count the number of page views. In SQL, you would write something like: - -{% highlight sql %} -SELECT user_id, COUNT(*) FROM PageViewEvent GROUP BY user_id -{% endhighlight %} - -Although Samza doesn't support SQL right now, the idea is the same. Two jobs are required to calculate this query: one to group messages by user ID, and the other to do the counting. - -In the first job, the grouping is done by sending all messages with the same user ID to the same partition of an intermediate topic. You can do this by using the user ID as key of the messages that are emitted by the first job, and this key is mapped to one of the intermediate topic's partitions (usually by taking a hash of the key mod the number of partitions). The second job consumes the intermediate topic. Each task in the second job consumes one partition of the intermediate topic, i.e. all the messages for a subset of user IDs. The task has a counter for each user ID in its partition, and the appropriate counter is incremented every time the task receives a message with a particular user ID. - -<img src="/img/0.7.0/learn/documentation/introduction/group-by-example.png" alt="Repartitioning for a GROUP BY" class="diagram-large"> - -If you are familiar with Hadoop, you may recognize this as a Map/Reduce operation, where each record is associated with a particular key in the mappers, records with the same key are grouped together by the framework, and then counted in the reduce step. The difference between Hadoop and Samza is that Hadoop operates on a fixed input, whereas Samza works with unbounded streams of data. - -Kafka takes the messages emitted by the first job and buffers them on disk, distributed across multiple machines. This helps make the system fault-tolerant: if one machine fails, no messages are lost, because they have been replicated to other machines. And if the second job goes slow or stops consuming messages for any reason, the first job is unaffected: the disk buffer can absorb the backlog of messages from the first job until the second job catches up again. - -By partitioning topics, and by breaking a stream process down into jobs and parallel tasks that run on multiple machines, Samza scales to streams with very high message throughput. By using YARN and Kafka, Samza achieves fault-tolerance: if a process or machine fails, it is automatically restarted on another machine and continues processing messages from the point where it left off. - -## [Comparison Introduction »](../comparisons/introduction.html) http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/0.7.0/introduction/background.md ---------------------------------------------------------------------- diff --git a/docs/learn/documentation/0.7.0/introduction/background.md b/docs/learn/documentation/0.7.0/introduction/background.md deleted file mode 100644 index e09497b..0000000 --- a/docs/learn/documentation/0.7.0/introduction/background.md +++ /dev/null @@ -1,71 +0,0 @@ ---- -layout: page -title: Background ---- -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -This page provides some background about stream processing, describes what Samza is, and why it was built. - -### What is messaging? - -Messaging systems are a popular way of implementing near-realtime asynchronous computation. Messages can be added to a message queue (ActiveMQ, RabbitMQ), pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when something happens. Downstream *consumers* read messages from these systems, and process them or take actions based on the message contents. - -Suppose you have a website, and every time someone loads a page, you send a "user viewed page" event to a messaging system. You might then have consumers which do any of the following: - -* Store the message in Hadoop for future analysis -* Count page views and update a dashboard -* Trigger an alert if a page view fails -* Send an email notification to another user -* Join the page view event with the user's profile, and send the message back to the messaging system - -A messaging system lets you decouple all of this work from the actual web page serving. - -### What is stream processing? - -A messaging system is a fairly low-level piece of infrastructure—it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer. Samza aims to help with these problems. - -Consider the counting example, above (count page views and update a dashboard). What happens when the machine that your consumer is running on fails, and your current counter values are lost? How do you recover? Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? (Unless you are careful, your counts will be incorrect.) What if you want to count page views grouped by the page URL? How do you distribute the computation across multiple machines if it's too much for a single machine to handle? - -Stream processing is a higher level of abstraction on top of messaging systems, and it's meant to address precisely this category of problems. - -### Samza - -Samza is a stream processing framework with the following features: - -* **Simple API:** Unlike most low-level messaging system APIs, Samza provides a very simple callback-based "process message" API comparable to MapReduce. -* **Managed state:** Samza manages snapshotting and restoration of a stream processor's state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition). -* **Fault tolerance:** Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine. -* **Durability:** Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost. -* **Scalability:** Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in. -* **Pluggable:** Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments. -* **Processor isolation:** Samza works with Apache YARN, which supports Hadoop's security model, and resource isolation through Linux CGroups. - -### Alternatives - -The available open source stream processing systems are actually quite young, and no single system offers a complete solution. New problems in this area include: how a stream processor's state should be managed, whether or not a stream should be buffered remotely on disk, what to do when duplicate messages are received or messages are lost, and how to model underlying messaging systems. - -Samza's main differentiators are: - -* Samza supports fault-tolerant local state. State can be thought of as tables that are split up and co-located with the processing tasks. State is itself modeled as a stream. If the local state is lost due to machine failure, the state stream is replayed to restore it. -* Streams are ordered, partitioned, replayable, and fault tolerant. -* YARN is used for processor isolation, security, and fault tolerance. -* Jobs are decoupled: if one job goes slow and builds up a backlog of unprocessed messages, the rest of the system is not affected. - -For a more in-depth discussion on Samza, and how it relates to other stream processing systems, have a look at Samza's [Comparisons](../comparisons/introduction.html) documentation. - -## [Concepts »](concepts.html) http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/0.7.0/introduction/concepts.md ---------------------------------------------------------------------- diff --git a/docs/learn/documentation/0.7.0/introduction/concepts.md b/docs/learn/documentation/0.7.0/introduction/concepts.md deleted file mode 100644 index dacc6af..0000000 --- a/docs/learn/documentation/0.7.0/introduction/concepts.md +++ /dev/null @@ -1,72 +0,0 @@ ---- -layout: page -title: Concepts ---- -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -This page gives an introduction to the high-level concepts in Samza. - -### Streams - -Samza processes *streams*. A stream is composed of immutable *messages* of a similar type or category. For example, a stream could be all the clicks on a website, or all the updates to a particular database table, or all the logs produced by a service, or any other type of event data. Messages can be appended to a stream or read from a stream. A stream can have any number of *consumers*, and reading from a stream doesn't delete the message (so each message is effectively broadcast to all consumers). Messages can optionally have an associated key which is used for partitioning, which we'll talk about in a second. - -Samza supports pluggable *systems* that implement the stream abstraction: in [Kafka](https://kafka.apache.org/) a stream is a topic, in a database we might read a stream by consuming updates from a table, in Hadoop we might tail a directory of files in HDFS. - - - -### Jobs - -A Samza *job* is code that performs a logical transformation on a set of input streams to append output messages to set of output streams. - -If scalability were not a concern, streams and jobs would be all we need. However, in order to scale the throughput of the stream processor, we chop streams and jobs up into smaller units of parallelism: *partitions* and *tasks*. - -### Partitions - -Each stream is broken into one or more partitions. Each partition in the stream is a totally ordered sequence of messages. - -Each message in this sequence has an identifier called the *offset*, which is unique per partition. The offset can be a sequential integer, byte offset, or string depending on the underlying system implementation. - -When a message is appended to a stream, it is appended to only one of the stream's partitions. The assignment of the message to its partition is done with a key chosen by the writer. For example, if the user ID is used as the key, that ensures that all messages related to a particular user end up in the same partition. - - - -### Tasks - -A job is scaled by breaking it into multiple *tasks*. The *task* is the unit of parallelism of the job, just as the partition is to the stream. Each task consumes data from one partition for each of the job's input streams. - -A task processes messages from each of its input partitions sequentially, in the order of message offset. There is no defined ordering across partitions. This allows each task to operate independently. The YARN scheduler assigns each task to a machine, so the job as a whole can be distributed across many machines. - -The number of tasks in a job is determined by the number of input partitions (there cannot be more tasks than input partitions, or there would be some tasks with no input). However, you can change the computational resources assigned to the job (the amount of memory, number of CPU cores, etc.) to satisfy the job's needs. See notes on *containers* below. - -The assignment of partitions to tasks never changes: if a task is on a machine that fails, the task is restarted elsewhere, still consuming the same stream partitions. - - - -### Dataflow Graphs - -We can compose multiple jobs to create a dataflow graph, where the nodes are streams containing data, and the edges are jobs performing transformations. This composition is done purely through the streams the jobs take as input and output. The jobs are otherwise totally decoupled: they need not be implemented in the same code base, and adding, removing, or restarting a downstream job will not impact an upstream job. - -These graphs are often acyclic—that is, data usually doesn't flow from a job, through other jobs, back to itself. However, it is possible to create cyclic graphs if you need to. - -<img src="/img/0.7.0/learn/documentation/introduction/dag.png" width="430" alt="Directed acyclic job graph"> - -### Containers - -Partitions and tasks are both *logical* units of parallelism—they don't correspond to any particular assignment of computational resources (CPU, memory, disk space, etc). Containers are the unit of physical parallelism, and a container is essentially a Unix process (or Linux [cgroup](http://en.wikipedia.org/wiki/Cgroups)). Each container runs one or more tasks. The number of tasks is determined automatically from the number of partitions in the input and is fixed, but the number of containers (and the CPU and memory resources associated with them) is specified by the user at run time and can be changed at any time. - -## [Architecture »](architecture.html)
