samza git commit: Clean up docs for standalone

jagadish Tue, 27 Nov 2018 00:29:18 -0800

Repository: samza
Updated Branches:
  refs/heads/master f529722f5 -> 6f40414d7



Clean up docs for standalone

Author: Jagadish <jvenkatra...@linkedin.com>

Reviewers: Jagadish<jagad...@apache.org>

Closes #817 from vjagadish1989/website-reorg32


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/6f40414d
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/6f40414d
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/6f40414d

Branch: refs/heads/master
Commit: 6f40414d7d6c04db20dac444211b6bfec6cf034b
Parents: f529722
Author: Jagadish <jvenkatra...@linkedin.com>
Authored: Tue Nov 27 00:28:39 2018 -0800
Committer: Jagadish <jvenkatra...@linkedin.com>
Committed: Tue Nov 27 00:28:39 2018 -0800

----------------------------------------------------------------------
 .../versioned/deployment/standalone.md          | 177 ++++---------------
 1 file changed, 35 insertions(+), 142 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/6f40414d/docs/learn/documentation/versioned/deployment/standalone.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/deployment/standalone.md 
b/docs/learn/documentation/versioned/deployment/standalone.md
index 146f9cb..d8bd541 100644
--- a/docs/learn/documentation/versioned/deployment/standalone.md
+++ b/docs/learn/documentation/versioned/deployment/standalone.md
@@ -1,6 +1,6 @@
 ---
 layout: page
-title: Run as embedded library.
+title: Run as embedded library
 ---
 <!--
    Licensed to the Apache Software Foundation (ASF) under one or more
@@ -20,145 +20,46 @@ title: Run as embedded library.
 -->
 
 - [Introduction](#introduction)
-- [User guide](#user-guide)
-     - [Setup dependencies](#setup-dependencies)
-     - [Configuration](#configuration)
-     - [Code sample](#code-sample)
-- [Quick start guide](#quick-start-guide)
-  - [Setup zookeeper](#setup-zookeeper)
-  - [Setup kafka](#setup-kafka)
-  - [Build binaries](#build-binaries)
-  - [Deploy binaries](#deploy-binaries)
-  - [Inspect results](#inspect-results)
+- [Quick start](#quick-start-guide)
+  - [Installing Zookeeper and Kafka](#setup-zookeeper)
+  - [Building binaries](#build-binaries)
+  - [Running the application](#deploy-binaries)
+  - [Inspecting results](#inspect-results)
 - [Coordinator internals](#coordinator-internals)
 
-#
 
-# Introduction
+### Introduction
 
-With Samza 0.13.0, the deployment model of samza jobs has been simplified and 
decoupled from YARN. _Standalone_ model provides the stream processing 
capabilities of samza packaged in the form of a library with pluggable 
coordination. This library model offers an easier integration path  and 
promotes a flexible deployment model for an application. Using the standalone 
mode, you can leverage Samza processors directly in your application and deploy 
Samza applications to self-managed clusters.
+Often you want to embed and integrate Samza as a component within a larger 
application. To enable this, Samza supports a standalone mode of deployment 
allowing greater control over your applicationâs life-cycle. In this model, 
Samza can be used just like any library you import within your Java 
application. This is identical to using a high-level Kafka consumer to process 
your streams.
 
-A standalone application typically is comprised of multiple _stream 
processors_. A _stream processor_ encapsulates a user defined processing 
function and is responsible for processing a subset of input topic partitions. 
A stream processor of a standalone application is uniquely identified by a 
_processorId_.
+You can increase your applicationâs capacity by spinning up multiple 
instances. These instances will then dynamically coordinate with each other and 
distribute work among themselves. If an instance fails for some reason, the 
tasks running on it will be re-assigned to the remaining ones. By default, 
Samza uses Zookeeper for coordination across individual instances. The 
coordination logic by itself is pluggable and hence, can integrate with other 
frameworks.
 
-Samza provides pluggable job _coordinator_ layer to perform leader election 
and assign work to the stream processors. Standalone supports Zookeeper 
coordination out of the box and uses it for distributed coordination between 
the stream processors of standalone application. A processor can become part of 
a standalone application by setting its app.name(Ex: app.name=group\_1) and 
joining the group.
+This mode allows you to bring any cluster-manager or hosting-environment of 
your choice(eg: Kubernetes, Marathon) to run your application. You are also 
free to control memory-limits, multi-tenancy on your own - since Samza is used 
as a light-weight library.
 
-In samza standalone, the input topic partitions are distributed between the 
available processors dynamically at runtime. In each standalone application, 
one stream processor will be chosen as a leader initially to mediate the 
assignment of input topic partitions to the stream processors. If the number of 
available processors changes(for example, if a processors is shutdown or 
added), then the leader processor will regenerate the partition assignments and 
re-distribute it to all the processors.
 
-On processor group change, the act of re-assigning input topic partitions to 
the remaining live processors in the group is known as rebalancing the group. 
On failure of the leader processor of a standalone application, an another 
stream processor of the standalone application will be chosen as leader.
+### Quick start
 
-## User guide
-
-Samza standalone is designed to help you to have more control over the 
deployment of the application. So it is your responsibility to configure and 
deploy the processors. In case of ZooKeeper coordination, you have to configure 
the URL for an instance of ZooKeeper.
-
-A stream processor is identified by a unique processorID which is generated by 
the pluggable ProcessorIdGenerator abstraction. ProcessorId of the stream 
processor is used with the coordination service. Samza supports UUID based 
ProcessorIdGenerator out of the box.
-
-The diagram below shows a input topic with three partitions and an standalone 
application with three processors consuming messages from it.
-
-<img 
src="/img/versioned/learn/documentation/standalone/standalone-application.jpg" 
alt="Standalone application" height="550px" width="700px" align="middle">
-
-When a group is first initialized, each stream processor typically starts 
processing messages from either the earliest or latest offset of the input 
topic partition. The messages in each partition are sequentially delivered to 
the user defined processing function. As the stream processor makes progress, 
it commits the offsets of the messages it has successfully processed. For 
example, in the figure above, the stream processor position is at offset 7 and 
its last committed offset is at offset 3.
-
-When a input partition is reassigned to another processor in the group, the 
initial position is set to the last committed offset. If the processor-1 in the 
example above suddenly crashed, then the live processor taking over the 
partition would begin consumption from offset 3. In that case, it would not 
have to reprocess the messages up to the crashed processor's position of 3.
-
-### Setup dependencies
-
-Add the following samza-standalone maven dependencies to your project.
-
-```xml
-<dependency>
-    <groupId>org.apache.samza</groupId>
-    <artifactId>samza-kafka_2.11</artifactId>
-    <version>1.0</version>
-</dependency>
-<dependency>
-    <groupId>org.apache.samza</groupId>
-    <artifactId>samza-core_2.11</artifactId>
-    <version>1.0</version>
-</dependency>
-<dependency>
-    <groupId>org.apache.samza</groupId>
-    <artifactId>samza-api</artifactId>
-    <version>1.0</version>
-</dependency>
-```
-
-### Configuration
-
-A samza standalone application requires you to define the following mandatory 
configurations:
-
-```bash
-job.coordinator.factory=org.apache.samza.zk.ZkJobCoordinatorFactory
-job.coordinator.zk.connect=your_zk_connection(for local zookeeper, use 
localhost:2181)
-task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
 
-```
-
-You have to configure the stream processor with the kafka brokers as defined 
in the following sample(we have assumed that the broker is running on 
localhost):
-
-```bash 
-systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
-systems.kafka.samza.msg.serde=json
-systems.kafka.consumer.zookeeper.connect=localhost:2181
-systems.kafka.producer.bootstrap.servers=localhost:9092 
-```
-
-### Code sample
-
-Here&#39;s a sample standalone application with app.name set to sample-test. 
Running this class would launch a stream processor.
-
-```java
-public class PageViewEventExample implements StreamApplication {
-
-  public static void main(String[] args) {
-    CommandLine cmdLine = new CommandLine();
-    OptionSet options = cmdLine.parser().parse(args);
-    Config config = cmdLine.loadConfig(options);
-
-    ApplicationRunner runner = 
ApplicationRunners.getApplicationRunner(ApplicationClassUtils.fromConfig(config),
 config);
-    runner.run();
-    runner.waitForFinish();
-  }
-
-  @Override
-  public void describe(StreamAppDescriptor appDesc) {
-     MessageStream<PageViewEvent> pageViewEvents = null;
-     pageViewEvents = appDesc.getInputStream("inputStream", new 
JsonSerdeV2<>(PageViewEvent.class));
-     OutputStream<KV<String, PageViewCount>> pageViewEventPerMemberStream =
-         appDesc.getOutputStream("outputStream",  new 
JsonSerdeV2<>(PageViewEvent.class));
-     pageViewEvents.sendTo(pageViewEventPerMemberStream);
-  }
-}
-```
-
-## Quick start guide
-
-The [Hello-samza](https://github.com/apache/samza-hello-samza/) project 
contains sample Samza standalone applications. Here are step by step 
instruction guide to install, build and run a standalone application binaries 
using the local zookeeper cluster for coordination. Check out the hello-samza 
project by running the following commands:
+The [Hello-samza](https://github.com/apache/samza-hello-samza/) project 
includes multiple examples of Samza standalone applications. Let us first check 
out the repository.
 
 ```bash
 git clone https://git.apache.org/samza-hello-samza.git hello-samza
 cd hello-samza 
 ```
 
-### Setup Zookeeper
-
-Run the following command to install and start a local zookeeper cluster.
-
-```bash
-./bin/grid install zookeeper
-./bin/grid start zookeeper
-```
 
-### Setup Kafka
+#### Installing Zookeeper and Kafka
 
-Run the following command to install and start a local kafka cluster.
+We will use the `./bin/grid` script from the `hello-samza` project to setup up 
Zookeeper and Kafka locally.
 
 ```bash
-./bin/grid install kafka
+./bin/grid start zookeeper
 ./bin/grid start zookeeper
 ```
 
-### Build binaries
 
-Before you can run the standalone job, you need to build a package for it 
using the following command.
+#### Building the binaries
+
+Let us now build the `hello-samza` project from its sources.
 
 ```bash
 mvn clean package
@@ -166,52 +67,44 @@ mkdir -p deploy/samza
 tar -xvf ./target/hello-samza-1.0.0-dist.tar.gz -C deploy/samza
 ```
 
-### Deploy binaries
+#### Running the application
 
-To run the sample standalone application 
[WikipediaZkLocalApplication](https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/wikipedia/application/WikipediaZkLocalApplication.java)
+We are ready to run the example application 
[WikipediaZkLocalApplication](https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/wikipedia/application/WikipediaZkLocalApplication.java).
 This application reads messages from the wikipedia-edits topic, and calculates 
counts, every ten seconds, for all edits that were made during that window. It 
emits these results to another topic named `wikipedia-stats`.
 
 ```bash
-./bin/deploy.sh
 ./deploy/samza/bin/run-class.sh 
samza.examples.wikipedia.application.WikipediaZkLocalApplication  
--config-factory=org.apache.samza.config.factories.PropertiesConfigFactory 
--config-path=file://$PWD/deploy/samza/config/wikipedia-application-local-runner.properties
 ```
 
-### Inspect results
+You can run the above command again to spin up a new instance of your 
application.
 
-The standalone application reads messages from the wikipedia-edits topic, and 
calculates counts, every ten seconds, for all edits that were made during that 
window. It outputs these counts to the local wikipedia-stats kafka topic. To 
inspect events in output topic, run the following command.
+#### Inspecting results
+
+To inspect events in output topic, run the following command.
 
 ```bash
 ./deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 
--topic wikipedia-stats
 ```
 
-Events produced to the output topic from the standalone application launched 
above will be of the following form:
+You should see the output messages emitted to the Kafka topic.
 
 ```
 {"is-talk":2,"bytes-added":5276,"edits":13,"unique-titles":13}
 
{"is-bot-edit":1,"is-talk":3,"bytes-added":4211,"edits":30,"unique-titles":30,"is-unpatrolled":1,"is-new":2,"is-minor":7}
 ```
 
-# Coordinator internals
-
-A samza application is comprised of multiple stream processors. A processor 
can become part of a standalone application by setting its app.name(Ex: 
app.name=group\_1) and joining the group. In samza standalone, the input topic 
partitions are distributed between the available processors dynamically at 
runtime. If the number of available processors changes(for example, if some 
processors are shutdown or added), then the partition assignments will be 
regenerated and re-distributed to all the processors. One processor will be 
elected as leader and it will generate the partition assignments and distribute 
it to the other processors in the group.
-
-To mediate the partition assignments between processors, samza standalone 
relies upon a coordination service. The main responsibilities of coordination 
service are the following:
-
-**Leader Election** - Elects a single processor to generate the partition 
assignments and distribute it to other processors in the group.
-
-**Distributed barrier** - Coordination primitive used by the processors to 
reach consensus(agree) on an partition assignment.
-
-By default, embedded samza uses Zookeeper for coordinating between processors 
of an application and store the partition assignment state. Coordination 
sequence for a standalone application is listed below:
-
-1. Each processor(participant) will register with the coordination 
service(e.g: Zookeeper) with its participant ID.
+### Standalone Coordinator Internals
 
-2. One of the participants will be elected as the leader.
+Samza runs your application by logically breaking its execution down into 
multiple tasks. A task is the unit of parallelism for your application, with 
each task consumeing data from one or more partitions of your input streams.
+Obviously, there is a limit to the throughput you can achieve with a single 
instance. You can increase parallelism by simply running multiple instances of 
your application. The individual instances can be executed on the same machine, 
or distributed across
+machines. Likewise, to scale down your parallelism, you can shut down a few 
instances of your application. Samza will coordinate among available instances 
and dynamically assign the tasks them. The coordination logic itself is 
+pluggable - with a Zookeeper-based implementation provided out of the box.
 
-3. The leader will monitor the list of all the active participants.
+Here is a typical sequence of actions when using Zookeeper for coordination.
 
-4. Whenever the list of the participants changes in a group, the leader will 
generate a new partition assignments for the current participants and persist 
it to a common storage.
+1. Everytime you spawn a new instance of your application, it registers itself 
with Zookeeper as a participant.
 
-5. Participants are notified that the new partition assignment is available. 
Notification is done through the coordination service(e.g. ZooKeeper).
+2. There is always a single leader - which acts as the coordinator. The 
coordinator which manages the assignment of tasks across the individual 
containers. The coordinator also monitors the liveness of individual containers 
and redistributes the tasks among the remaining ones during a failure.  
 
-6. The participants will stop processing, pick up the new partition 
assignment, and then resume processing.
+3. Whenever a new instance joins or leaves the group, it triggers a 
notification to the leader. The leader can then recompute assignments of tasks 
to the live instances.
 
-In order to ensure that no two partitions are processed by different 
processors, processing is paused and all the processors will synchronize on a 
distributed barrier. Once all the processors are paused, the new partition 
assignments are applied, after which processing resumes.
\ No newline at end of file
+4. Once the leader publishes new partition assignments, all running instances 
pick up the new assignment and resume processing.

samza git commit: Clean up docs for standalone

Reply via email to