samza git commit: Add top-level page for Samza's architecture

jagadish Thu, 11 Oct 2018 16:21:02 -0700

Repository: samza
Updated Branches:
  refs/heads/master 5c5afb82a -> 992b217d2



Add top-level page for Samza's architecture

Author: Jagadish <jvenkatra...@linkedin.com>

Reviewers: Jagadish <jagad...@apache.org>

Closes #710 from vjagadish1989/website-reorg7


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/992b217d
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/992b217d
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/992b217d

Branch: refs/heads/master
Commit: 992b217d26278731ae386dca24d19a56c27a7cbe
Parents: 5c5afb8
Author: Jagadish <jvenkatra...@linkedin.com>
Authored: Thu Oct 11 16:19:57 2018 -0700
Committer: Jagadish <jvenkatra...@linkedin.com>
Committed: Thu Oct 11 16:19:57 2018 -0700

----------------------------------------------------------------------
 .../architecture/distributed-execution.png      | Bin 0 -> 16510 bytes
 .../architecture/fault-tolerance.png            | Bin 0 -> 20244 bytes
 .../architecture/incremental-checkpointing.png  | Bin 0 -> 15352 bytes
 .../documentation/architecture/state-store.png  | Bin 0 -> 16436 bytes
 .../architecture/task-assignment.png            | Bin 0 -> 11384 bytes
 .../architecture/architecture-overview.md       |  65 ++++++++++++++++++-
 .../versioned/core-concepts/core-concepts.md    |  65 ++++++++++++++++++-
 7 files changed, 128 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/992b217d/docs/img/versioned/learn/documentation/architecture/distributed-execution.png
----------------------------------------------------------------------
diff --git 
a/docs/img/versioned/learn/documentation/architecture/distributed-execution.png 
b/docs/img/versioned/learn/documentation/architecture/distributed-execution.png
new file mode 100644
index 0000000..b4f7714
Binary files /dev/null and 
b/docs/img/versioned/learn/documentation/architecture/distributed-execution.png 
differ

http://git-wip-us.apache.org/repos/asf/samza/blob/992b217d/docs/img/versioned/learn/documentation/architecture/fault-tolerance.png
----------------------------------------------------------------------
diff --git 
a/docs/img/versioned/learn/documentation/architecture/fault-tolerance.png 
b/docs/img/versioned/learn/documentation/architecture/fault-tolerance.png
new file mode 100644
index 0000000..5146f03
Binary files /dev/null and 
b/docs/img/versioned/learn/documentation/architecture/fault-tolerance.png differ

http://git-wip-us.apache.org/repos/asf/samza/blob/992b217d/docs/img/versioned/learn/documentation/architecture/incremental-checkpointing.png
----------------------------------------------------------------------
diff --git 
a/docs/img/versioned/learn/documentation/architecture/incremental-checkpointing.png
 
b/docs/img/versioned/learn/documentation/architecture/incremental-checkpointing.png
new file mode 100644
index 0000000..4465ed5
Binary files /dev/null and 
b/docs/img/versioned/learn/documentation/architecture/incremental-checkpointing.png
 differ

http://git-wip-us.apache.org/repos/asf/samza/blob/992b217d/docs/img/versioned/learn/documentation/architecture/state-store.png
----------------------------------------------------------------------
diff --git 
a/docs/img/versioned/learn/documentation/architecture/state-store.png 
b/docs/img/versioned/learn/documentation/architecture/state-store.png
new file mode 100644
index 0000000..6cf8c24
Binary files /dev/null and 
b/docs/img/versioned/learn/documentation/architecture/state-store.png differ

http://git-wip-us.apache.org/repos/asf/samza/blob/992b217d/docs/img/versioned/learn/documentation/architecture/task-assignment.png
----------------------------------------------------------------------
diff --git 
a/docs/img/versioned/learn/documentation/architecture/task-assignment.png 
b/docs/img/versioned/learn/documentation/architecture/task-assignment.png
new file mode 100644
index 0000000..9cd4ada
Binary files /dev/null and 
b/docs/img/versioned/learn/documentation/architecture/task-assignment.png differ

http://git-wip-us.apache.org/repos/asf/samza/blob/992b217d/docs/learn/documentation/versioned/architecture/architecture-overview.md
----------------------------------------------------------------------
diff --git 
a/docs/learn/documentation/versioned/architecture/architecture-overview.md 
b/docs/learn/documentation/versioned/architecture/architecture-overview.md
index 6c1fbb1..282352c 100644
--- a/docs/learn/documentation/versioned/architecture/architecture-overview.md
+++ b/docs/learn/documentation/versioned/architecture/architecture-overview.md
@@ -19,5 +19,68 @@ title: Architecture page
    limitations under the License.
 -->
 
-## Samza architecture page
+- [Distributed execution](#distributed-execution)
+     - [Task](#task)
+     - [Container](#container)
+     - [Coordinator](#coordinator)
+- [Threading model and ordering](#threading-model)
+- [Incremental checkpointing](#incremental-checkpoints)
+- [State management](#state-management)
+- [Fault tolerance of state](#fault-tolerance-of-state)
+- [Host affinity](#host-affinity)
+
+
+
+## Distributed execution
+
+### Task 
+
+![diagram-large](/img/{{site.version}}/learn/documentation/architecture/task-assignment.png)
+
+Samza scales your application by logically breaking it down into multiple 
tasks. A task is the unit of parallelism for your application. Each task 
consumes data from one partition of your input streams. The assignment of 
partitions to tasks never changes: if a task is on a machine that fails, the 
task is restarted elsewhere, still consuming the same stream partitions. Since 
there is no ordering of messages across partitions, it allows tasks to execute 
entirely independent of each other without sharing any state. 
+
+
+### Container 
+![diagram-large](/img/{{site.version}}/learn/documentation/architecture/distributed-execution.png)
+
+Just like a task is the logical unit of parallelism for your application, a 
container is the physical unit. You can think of each worker as a JVM process, 
which runs one or more tasks. An application typically has multiple containers 
distributed across hosts. 
+
+### Coordinator
+Each application also has a coordinator which manages the assignment of tasks 
across the individual containers. The coordinator monitors the liveness of 
individual containers and redistributes the tasks among the remaining ones 
during a failure. <br/><br/>
+The coordinator itself is pluggable, enabling Samza to support multiple 
deployment options. You can use Samza as a light-weight embedded library that 
easily integrates with a larger application. Alternately, you can deploy and 
run it as a managed framework using a cluster-manager like YARN. It is worth 
noting that Samza is the only system that offers first-class support for both 
these deployment options. Some systems like Kafka-streams only support the 
embedded library model while others like Flink, Spark streaming etc., only 
offer the framework model for stream-processing.
+
+## Threading model and ordering
+
+Samza offers a flexible threading model to run each task. When running your 
applications, you can control the number of workers needed to process your 
data. You can also configure the number of threads each worker uses to run its 
assigned tasks. Each thread can run one or more tasks. Tasks donât share any 
state - hence, you donât have to worry about coordination across these 
threads. 
+
+Another common scenario in stream processing is to interact with remote 
services or databases. For example, a notifications system which processes each 
incoming message, generates an email and invokes a REST api to deliver it. 
Samza offers a fully asynchronous API for use-cases like this which require 
high-throughput remote I/O. 
+s
+By default, all messages delivered to a task are processed by the same thread. 
This guarantees in-order processing of messages within a partition. However, 
some applications donât care about in-order processing of messages. For such 
use-cases, Samza also supports processing messages out-of-order within a single 
partition. This typically offers higher throughput by allowing for multiple 
concurrent messages in each partition.
+
+## Incremental checkpointing 
+![diagram-large](/img/{{site.version}}/learn/documentation/architecture/incremental-checkpointing.png)
+
+Samza guarantees that messages wonât be lost, even if your job crashes, if a 
machine dies, if there is a network fault, or something else goes wrong. To 
achieve this property, each task periodically persists the last processed 
offsets for its input stream partitions. If a task needs to be restarted on a 
different worker due to a failure, it resumes processing from its latest 
checkpoint. 
+
+Samzaâs checkpointing mechanism ensures each task also stores the contents 
of its state-store consistently with its last processed offsets. Checkpoints 
are flushed incrementally ie., the state-store only flushes the delta since the 
previous checkpoint instead of flushing its entire state.
+
+## State management
+Samza offers scalable, high-performance storage to enable you to build 
stateful stream-processing applications. This is implemented by associating 
each Samza task with its own instance of a local database (aka. a state-store). 
The state-store associated with a particular task only stores data 
corresponding to the partitions processed by that task. This is important: when 
you scale out your job by giving it more computing resources, Samza 
transparently migrates the tasks from one machine to another. By giving each 
task its own state, tasks can be relocated without affecting your overall 
application. 
+![diagram-large](/img/{{site.version}}/learn/documentation/architecture/state-store.png)
+
+Here are some key advantages of this architecture. <br/>
+- The state is stored on disk, so the job can maintain more state than would 
fit in memory. <br/>
+- It is stored on the same machine as the task, to avoid the performance 
problems of making database queries over the network. <br/>
+- Each job has its own store, to avoid the isolation issues in a shared remote 
database (if you make an expensive query, it affects only the current task, 
nobody else). <br/>
+- Different storage engines can be plugged in - for example, a remote 
data-store that enables richer query capabilities <br/>
+
+## Fault tolerance of state
+Distributed stream processing systems need recover quickly from failures to 
resume their processing. While having a durable local store offers great 
performance, we should still guarantee fault-tolerance. For this purpose, Samza 
replicates every change to the local store into a separate stream (aka. called 
a changelog for the store). This allows you to later recover the data in the 
store by reading the contents of the changelog from the beginning. A 
log-compacted Kafka topic is typically used as a changelog since Kafka 
automatically retains the most recent value for each key.
+![diagram-large](/img/{{site.version}}/learn/documentation/architecture/fault-tolerance.png)
+
+## Host affinity
+If your application has several terabytes of state, then bootstrapping it 
every time by reading the changelog will stall progress. So, itâs critical to 
be able to recover state swiftly during failures. For this purpose, Samza takes 
data-locality into account when scheduling tasks on hosts. This is implemented 
by persisting metadata about the host each task is currently running on. 
+
+During a new deployment of the application, Samza tries to re-schedule the 
tasks on the same hosts they were previously on. This enables the task to 
re-use the snapshot of its local-state from its previous run on that host. We 
call this feature _host-affinity_ since it tries to preserve the assignment of 
tasks to hosts. This is a key differentiator that enables Samza applications to 
scale to several terabytes of local-state with effectively zero downtime.
+
 

http://git-wip-us.apache.org/repos/asf/samza/blob/992b217d/docs/learn/documentation/versioned/core-concepts/core-concepts.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/core-concepts/core-concepts.md 
b/docs/learn/documentation/versioned/core-concepts/core-concepts.md
index 449b338..52965a2 100644
--- a/docs/learn/documentation/versioned/core-concepts/core-concepts.md
+++ b/docs/learn/documentation/versioned/core-concepts/core-concepts.md
@@ -18,6 +18,69 @@ title: Core concepts
    See the License for the specific language governing permissions and
    limitations under the License.
 -->
+- [Introduction](#introduction)
+- [Streams, Partitions](#streams-partitions)
+- [Stream Application](#stream-application)
+- [State](#state)
+- [Time](#time)
+- [Processing guarantee](#processing-guarantee)
 
-## Core concepts page
+## Introduction
+
+Apache Samza is a scalable data processing engine that allows you to process 
and analyze your data in real-time. Here is a summary of Samzaâs features 
that simplify building your applications:
+
+_**Unified API:**_ Use a simple API to describe your application-logic in a 
manner independent of your data-source. The same API can process both batch and 
streaming data.
+
+*Pluggability at every level:* Process and transform data from any source. 
Samza offers built-in integrations with [Apache 
Kafka](/learn/documentation/{{site.version}}/connectors/kafka.html), [AWS 
Kinesis](/learn/documentation/{{site.version}}/connectors/kinesis.html), [Azure 
EventHubs](/learn/documentation/{{site.version}}/connectors/kinesis.html), 
ElasticSearch and [Apache 
Hadoop](/learn/documentation/{{site.version}}/connectors/hdfs.html). Also, 
itâs quite easy to integrate with your own sources.
+
+*Samza as an embedded library:* Integrate effortlessly with your existing 
applications eliminating the need to spin up and operate a separate cluster for 
stream processing. Samza can be used as a light-weight client-library 
[embedded](/learn/documentation/{{site.version}}/deployment/standalone.html) in 
your Java/Scala applications. 
+
+*Write once, Run anywhere:* [Flexible deployment 
options](/learn/documentation/{{site.version}}/deployment/deployment-model.html)
  to run applications anywhere - from public clouds to containerized 
environments to bare-metal hardware.
+
+*Samza as a managed service:* Run stream-processing as a managed service by 
integrating with popular cluster-managers including [Apache 
YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).
 
+
+*Fault-tolerance:*  Transparently migrate tasks along with their associated 
state in the event of failures. Samza supports 
[host-affinity](/learn/documentation/{{site.version}}/architecture/architecture-overview.html#host-affinity)
 and [incremental 
checkpointing](/learn/documentation/{{site.version}}/architecture/architecture-overview.html#incremental-checkpoints)
 to enable fast recovery from failures.
+
+*Massive scale:* Battle-tested on applications that use several terabytes of 
state and run on thousands of cores. It [powers](/powered-by/) multiple large 
companies including LinkedIn, Uber, TripAdvisor, Slack etc. 
+
+Next, we will introduce Samzaâs terminology. You will realize that it is 
extremely easy to get started with [building](/quickstart/{{site.version}}) 
your first stream-processing application. 
+
+
+## Streams, Partitions
+Samza processes your data in the form of streams. A _stream_ is a collection 
of immutable messages, usually of the same type or category. Each message in a 
stream is modelled as a key-value pair. 
+
+![diagram-medium](/img/{{site.version}}/learn/documentation/core-concepts/streams-partitions.png)
+<br/>
+A stream can have multiple producers that write data to it and multiple 
consumers that read data from it. Data in a stream can be unbounded (eg: a 
Kafka topic) or bounded (eg: a set of files on HDFS). 
+
+A stream is sharded into multiple partitions for scaling how its data is 
processed. Each _partition_ is an ordered, replayable sequence of records. When 
a message is written to a stream, it ends up in one its partitions. Each 
message in a partition is uniquely identified by an _offset_. 
+
+Samza supports for pluggable systems that can implement the stream 
abstraction. As an example, Kafka implements a stream as a topic while another 
database might implement a stream as a sequence of updates to its tables.
+
+## Stream Application
+A _stream application_ processes messages from input streams, transforms them 
and emits results to an output stream or a database. It is built by chaining 
multiple operators, each of which take in one or more streams and transform 
them.
+
+![diagram-medium](/img/{{site.version}}/learn/documentation/core-concepts/stream-application.png)
+
+Samza offers three top-level APIs to help you build your stream applications: 
<br/>
+1. The [Samza Streams 
DSL](/learn/documentation/{{site.version}}/api/high-level-api.html),  which 
offers several built-in operators like map, filter etc. This is the recommended 
API for most use-cases. <br/>
+2. The [low-level 
API](/learn/documentation/{{site.version}}/api/low-level-api.html), which 
allows greater flexibility to define your processing-logic and offers for 
greater control <br/>
+3. [Samza SQL](/learn/documentation/{{site.version}}/api/samza-sql.html), 
which offers a declarative SQL interface to create your applications <br/>
+
+## State
+Samza supports for both stateless and stateful stream processing. _Stateless 
processing_, as the name implies, does not retain any state associated with the 
current message after it has been processed. A good example of this is to 
filter an incoming stream of user-records by a field (eg:userId) and write the 
filtered messages to their own stream. 
+
+In contrast, _stateful processing_ requires you to record some state about a 
message even after processing it. Consider the example of counting the number 
of unique users to a website every five minutes. This requires you to record 
state about each user seen thus far, for deduping later. Samza offers a 
fault-tolerant, scalable state-store for this purpose.
+
+## Time
+Time is a fundamental concept in stream processing, especially how it is 
modeled and interpreted by the system. Samza supports two notions of dealing 
with time. By default, all built-in Samza operators use processing time. In 
processing time, the timestamp of a message is determined by when it is 
processed by the system. For example, an event generated by a sensor could be 
processed by Samza several milliseconds later. 
+
+On the other hand, in event time, the timestamp of an event is determined by 
when it actually occurred in the source. For example, a sensor which generates 
an event could embed the time of occurrence as a part of the event itself. 
Samza provides event-time based processing by its integration with [Apache 
BEAM](https://beam.apache.org/documentation/runners/samza/).
+
+## Processing guarantee
+Samza supports at-least once processing. As the name implies, this ensures 
that each message in the input stream is processed by the system at-least once. 
This guarantees no data-loss even when there are failures making Samza a 
practical choice for building fault-tolerant applications.
+
+
+Next Steps: We are now ready to have a closer look at Samzaâs architecture.
+## [Architecture 
&raquo;](/learn/documentation/{{site.version}}/architecture/architecture-overview.html)

samza git commit: Add top-level page for Samza's architecture

Reply via email to