Repository: samza
Updated Branches:
  refs/heads/master 2e6a79ad1 -> e25e0dab9


Use consistent font /heading sizes for all pages


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/428861be
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/428861be
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/428861be

Branch: refs/heads/master
Commit: 428861be157b55dca6c5901a1c5e13b25bd7b1e6
Parents: 6cdcdef
Author: Jagadish <jvenkatra...@linkedin.com>
Authored: Tue Nov 27 03:18:22 2018 -0800
Committer: Jagadish <jvenkatra...@linkedin.com>
Committed: Tue Nov 27 03:20:28 2018 -0800

----------------------------------------------------------------------
 .../versioned/architecture/architecture-overview.md | 10 +++++-----
 .../documentation/versioned/connectors/eventhubs.md | 10 +++++-----
 .../documentation/versioned/connectors/hdfs.md      | 16 ++++++++--------
 .../documentation/versioned/connectors/kinesis.md   |  6 +++---
 .../versioned/core-concepts/core-concepts.md        | 14 +++++++-------
 .../documentation/versioned/deployment/yarn.md      | 13 ++++++-------
 6 files changed, 34 insertions(+), 35 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/428861be/docs/learn/documentation/versioned/architecture/architecture-overview.md
----------------------------------------------------------------------
diff --git 
a/docs/learn/documentation/versioned/architecture/architecture-overview.md 
b/docs/learn/documentation/versioned/architecture/architecture-overview.md
index 282352c..8bfe574 100644
--- a/docs/learn/documentation/versioned/architecture/architecture-overview.md
+++ b/docs/learn/documentation/versioned/architecture/architecture-overview.md
@@ -49,7 +49,7 @@ Just like a task is the logical unit of parallelism for your 
application, a cont
 Each application also has a coordinator which manages the assignment of tasks 
across the individual containers. The coordinator monitors the liveness of 
individual containers and redistributes the tasks among the remaining ones 
during a failure. <br/><br/>
 The coordinator itself is pluggable, enabling Samza to support multiple 
deployment options. You can use Samza as a light-weight embedded library that 
easily integrates with a larger application. Alternately, you can deploy and 
run it as a managed framework using a cluster-manager like YARN. It is worth 
noting that Samza is the only system that offers first-class support for both 
these deployment options. Some systems like Kafka-streams only support the 
embedded library model while others like Flink, Spark streaming etc., only 
offer the framework model for stream-processing.
 
-## Threading model and ordering
+### Threading model and ordering
 
 Samza offers a flexible threading model to run each task. When running your 
applications, you can control the number of workers needed to process your 
data. You can also configure the number of threads each worker uses to run its 
assigned tasks. Each thread can run one or more tasks. Tasks don’t share any 
state - hence, you don’t have to worry about coordination across these 
threads. 
 
@@ -57,14 +57,14 @@ Another common scenario in stream processing is to interact 
with remote services
 s
 By default, all messages delivered to a task are processed by the same thread. 
This guarantees in-order processing of messages within a partition. However, 
some applications don’t care about in-order processing of messages. For such 
use-cases, Samza also supports processing messages out-of-order within a single 
partition. This typically offers higher throughput by allowing for multiple 
concurrent messages in each partition.
 
-## Incremental checkpointing 
+### Incremental checkpointing 
 
![diagram-large](/img/{{site.version}}/learn/documentation/architecture/incremental-checkpointing.png)
 
 Samza guarantees that messages won’t be lost, even if your job crashes, if a 
machine dies, if there is a network fault, or something else goes wrong. To 
achieve this property, each task periodically persists the last processed 
offsets for its input stream partitions. If a task needs to be restarted on a 
different worker due to a failure, it resumes processing from its latest 
checkpoint. 
 
 Samza’s checkpointing mechanism ensures each task also stores the contents 
of its state-store consistently with its last processed offsets. Checkpoints 
are flushed incrementally ie., the state-store only flushes the delta since the 
previous checkpoint instead of flushing its entire state.
 
-## State management
+### State management
 Samza offers scalable, high-performance storage to enable you to build 
stateful stream-processing applications. This is implemented by associating 
each Samza task with its own instance of a local database (aka. a state-store). 
The state-store associated with a particular task only stores data 
corresponding to the partitions processed by that task. This is important: when 
you scale out your job by giving it more computing resources, Samza 
transparently migrates the tasks from one machine to another. By giving each 
task its own state, tasks can be relocated without affecting your overall 
application. 
 
![diagram-large](/img/{{site.version}}/learn/documentation/architecture/state-store.png)
 
@@ -74,11 +74,11 @@ Here are some key advantages of this architecture. <br/>
 - Each job has its own store, to avoid the isolation issues in a shared remote 
database (if you make an expensive query, it affects only the current task, 
nobody else). <br/>
 - Different storage engines can be plugged in - for example, a remote 
data-store that enables richer query capabilities <br/>
 
-## Fault tolerance of state
+### Fault tolerance of state
 Distributed stream processing systems need recover quickly from failures to 
resume their processing. While having a durable local store offers great 
performance, we should still guarantee fault-tolerance. For this purpose, Samza 
replicates every change to the local store into a separate stream (aka. called 
a changelog for the store). This allows you to later recover the data in the 
store by reading the contents of the changelog from the beginning. A 
log-compacted Kafka topic is typically used as a changelog since Kafka 
automatically retains the most recent value for each key.
 
![diagram-large](/img/{{site.version}}/learn/documentation/architecture/fault-tolerance.png)
 
-## Host affinity
+### Host affinity
 If your application has several terabytes of state, then bootstrapping it 
every time by reading the changelog will stall progress. So, it’s critical to 
be able to recover state swiftly during failures. For this purpose, Samza takes 
data-locality into account when scheduling tasks on hosts. This is implemented 
by persisting metadata about the host each task is currently running on. 
 
 During a new deployment of the application, Samza tries to re-schedule the 
tasks on the same hosts they were previously on. This enables the task to 
re-use the snapshot of its local-state from its previous run on that host. We 
call this feature _host-affinity_ since it tries to preserve the assignment of 
tasks to hosts. This is a key differentiator that enables Samza applications to 
scale to several terabytes of local-state with effectively zero downtime.

http://git-wip-us.apache.org/repos/asf/samza/blob/428861be/docs/learn/documentation/versioned/connectors/eventhubs.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/connectors/eventhubs.md 
b/docs/learn/documentation/versioned/connectors/eventhubs.md
index 9fdc861..11de3ff 100644
--- a/docs/learn/documentation/versioned/connectors/eventhubs.md
+++ b/docs/learn/documentation/versioned/connectors/eventhubs.md
@@ -19,7 +19,7 @@ title: Event Hubs Connector
    limitations under the License.
 -->
 
-## EventHubs I/O: QuickStart
+### EventHubs I/O: QuickStart
 
 The Samza EventHubs connector provides access to [Azure 
EventHubs](https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features),
 Microsoft’s data streaming service on Azure. An eventhub is similar to a 
Kafka topic and can have multiple partitions with producers and consumers. Each 
message produced or consumed from an event hub is an instance of 
[EventData](https://docs.microsoft.com/en-us/java/api/com.microsoft.azure.eventhubs._event_data).
 
 
@@ -67,9 +67,9 @@ Hence, you should also provide your SAS keys and tokens to 
access the stream. Yo
 ####Data Model
 Each event produced and consumed from an EventHubs stream is an instance of 
[EventData](https://docs.microsoft.com/en-us/java/api/com.microsoft.azure.eventhubs._event_data),
 which wraps a byte-array payload. When producing to EventHubs, Samza 
serializes your object into an `EventData` payload before sending it over the 
wire. Likewise, when consuming messages from EventHubs, messages are 
de-serialized into typed objects using the provided Serde. 
 
-## Configuration
+### Configuration
 
-###Producer partitioning
+####Producer partitioning
 
 You can use `#withPartitioningMethod` to control how outgoing messages are 
partitioned. The following partitioning schemes are supported:
 
@@ -85,7 +85,7 @@ EventHubsSystemDescriptor systemDescriptor = new 
EventHubsSystemDescriptor("even
 {% endhighlight %}
 
 
-### Consumer groups
+#### Consumer groups
 
 Event Hubs supports the notion of [consumer 
groups](https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features#consumer-groups)
 which enable multiple applications to have their own view of the event stream. 
Each partition is exclusively consumed by one consumer in the group. Each event 
hub stream has a pre-defined consumer group named $Default. You can define your 
own consumer group for your job using `withConsumerGroup`.
 
@@ -97,7 +97,7 @@ EventHubsInputDescriptor<KV<String, String>> inputDescriptor =
 {% endhighlight %}
 
 
-### Consumer buffer size
+#### Consumer buffer size
 
 When the consumer reads a message from EventHubs, it appends them to a shared 
producer-consumer queue corresponding to its partition. This config determines 
the per-partition queue size. Setting a higher value for this config typically 
achieves a higher throughput at the expense of increased on-heap memory.
 

http://git-wip-us.apache.org/repos/asf/samza/blob/428861be/docs/learn/documentation/versioned/connectors/hdfs.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/connectors/hdfs.md 
b/docs/learn/documentation/versioned/connectors/hdfs.md
index 9b79f24..ece7bbf 100644
--- a/docs/learn/documentation/versioned/connectors/hdfs.md
+++ b/docs/learn/documentation/versioned/connectors/hdfs.md
@@ -19,17 +19,17 @@ title: HDFS Connector
    limitations under the License.
 -->
 
-## Overview
+### Overview
 
 The HDFS connector allows your Samza jobs to read data stored in HDFS files. 
Likewise, you can write processed results to HDFS. 
 To interact with HDFS, Samza requires your job to run on the same YARN cluster.
 
-## Consuming from HDFS
-### Input Partitioning
+### Consuming from HDFS
+#### Input Partitioning
 
 Partitioning works at the level of individual directories and files. Each 
directory is treated as its own stream and each of its files is treated as a 
_partition_. For example, Samza creates 5 partitions when it's reading from a 
directory containing 5 files. There is no way to parallelize the consumption 
when reading from a single file - you can only have one container to process 
the file.
 
-### Input Event format
+#### Input Event format
 Samza supports avro natively, and it's easy to extend to other serialization 
formats. Each avro record read from HDFS is wrapped into a message-envelope. 
The 
[envelope](../api/javadocs/org/apache/samza/system/IncomingMessageEnvelope.html)
 contains these 3 fields:
 
 - The key, which is empty
@@ -40,12 +40,12 @@ Samza supports avro natively, and it's easy to extend to 
other serialization for
 
 To support non-avro input formats, you can implement the 
[SingleFileHdfsReader](https://github.com/apache/samza/blob/master/samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java)
 interface.
 
-### EndOfStream
+#### EndOfStream
 
 While streaming sources like Kafka are unbounded, files on HDFS have finite 
data and have a notion of EOF. When reading from HDFS, your Samza job 
automatically exits after consuming all the data. You can implement 
[EndOfStreamListenerTask](../api/javadocs/org/apache/samza/task/EndOfStreamListenerTask.html)
 to get a callback once EOF has been reached. 
 
 
-### Defining streams
+#### Defining streams
 
 Samza uses the notion of a _system_ to describe any I/O source it interacts 
with. To consume from HDFS, you should create a new system that points to - 
`HdfsSystemFactory`. You can then associate multiple streams with this 
_system_. Each stream should have a _physical name_, which should be set to the 
name of the directory on HDFS.
 
@@ -68,7 +68,7 @@ 
systems.hdfs.partitioner.defaultPartitioner.blacklist=somefile.avro
 {% endhighlight %}
 
 
-## Producing to HDFS
+### Producing to HDFS
 
 #### Output format
 
@@ -104,7 +104,7 @@ systems.hdfs.producer.hdfs.write.batch.size.bytes=134217728
 systems.hdfs.producer.hdfs.write.batch.size.records=10000
 {% endhighlight %}
 
-## Security 
+### Security 
 
 You can access Kerberos-enabled HDFS clusters by providing your principal and 
the path to your key-tab file. Samza takes care of automatically creating and 
renewing your Kerberos tokens periodically. 
 

http://git-wip-us.apache.org/repos/asf/samza/blob/428861be/docs/learn/documentation/versioned/connectors/kinesis.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/connectors/kinesis.md 
b/docs/learn/documentation/versioned/connectors/kinesis.md
index 85149f6..e319e92 100644
--- a/docs/learn/documentation/versioned/connectors/kinesis.md
+++ b/docs/learn/documentation/versioned/connectors/kinesis.md
@@ -19,7 +19,7 @@ title: Kinesis Connector
    limitations under the License.
 -->
 
-## Kinesis I/O: Quickstart
+### Kinesis I/O: Quickstart
 
 The Samza Kinesis connector allows you to interact with [Amazon Kinesis Data 
Streams](https://aws.amazon.com/kinesis/data-streams),
 Amazon’s data streaming service. The `hello-samza` project includes an 
example of processing Kinesis streams using Samza. Here is the complete [source 
code](https://github.com/apache/samza-hello-samza/blob/master/src/main/java/samza/examples/kinesis/KinesisHelloSamza.java)
 and 
[configs](https://github.com/apache/samza-hello-samza/blob/master/src/main/config/kinesis-hello-samza.properties).
@@ -32,9 +32,9 @@ Each message consumed from the stream is an instance of a 
Kinesis [Record](http:
 Samza’s 
[KinesisSystemConsumer](https://github.com/apache/samza/blob/master/samza-aws/src/main/java/org/apache/samza/system/kinesis/consumer/KinesisSystemConsumer.java)
 wraps the Record into a 
[KinesisIncomingMessageEnvelope](https://github.com/apache/samza/blob/master/samza-aws/src/main/java/org/apache/samza/system/kinesis/consumer/KinesisIncomingMessageEnvelope.java).
 
-## Consuming from Kinesis
+### Consuming from Kinesis
 
-### Basic Configuration
+#### Basic Configuration
 
 Here is the required configuration for consuming messages from Kinesis. 
 

http://git-wip-us.apache.org/repos/asf/samza/blob/428861be/docs/learn/documentation/versioned/core-concepts/core-concepts.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/core-concepts/core-concepts.md 
b/docs/learn/documentation/versioned/core-concepts/core-concepts.md
index b69de3d..8e2ce93 100644
--- a/docs/learn/documentation/versioned/core-concepts/core-concepts.md
+++ b/docs/learn/documentation/versioned/core-concepts/core-concepts.md
@@ -25,7 +25,7 @@ title: Core concepts
 - [Time](#time)
 - [Processing guarantee](#processing-guarantee)
 
-## Introduction
+### Introduction
 
 Apache Samza is a scalable data processing engine that allows you to process 
and analyze your data in real-time. Here is a summary of Samza’s features 
that simplify building your applications:
 
@@ -46,7 +46,7 @@ _**Unified API:**_ Use a simple API to describe your 
application-logic in a mann
 Next, we will introduce Samza’s terminology. You will realize that it is 
extremely easy to [get started](/quickstart/{{site.version}}) with building 
your first application. 
 
 
-## Streams, Partitions
+### Streams, Partitions
 Samza processes your data in the form of streams. A _stream_ is a collection 
of immutable messages, usually of the same type or category. Each message in a 
stream is modelled as a key-value pair. 
 
 
![diagram-medium](/img/{{site.version}}/learn/documentation/core-concepts/streams-partitions.png)
@@ -57,7 +57,7 @@ A stream is sharded into multiple partitions for scaling how 
its data is process
 
 Samza supports pluggable systems that can implement the stream abstraction. As 
an example, Kafka implements a stream as a topic while a database might 
implement a stream as a sequence of updates to its tables.
 
-## Stream Application
+### Stream Application
 A _stream application_ processes messages from input streams, transforms them 
and emits results to an output stream or a database. It is built by chaining 
multiple operators, each of which take in one or more streams and transform 
them.
 
 
![diagram-medium](/img/{{site.version}}/learn/documentation/core-concepts/stream-application.png)
@@ -67,20 +67,20 @@ Samza offers three top-level APIs to help you build your 
stream applications: <b
 2. The [Low Level Task 
API](/learn/documentation/{{site.version}}/api/low-level-api.html), which 
allows greater flexibility to define your processing-logic and offers greater 
control <br/>
 3. [Samza SQL](/learn/documentation/{{site.version}}/api/samza-sql.html), 
which offers a declarative SQL interface to create your applications <br/>
 
-## State
+### State
 Samza supports for both stateless and stateful stream processing. _Stateless 
processing_, as the name implies, does not retain any state associated with the 
current message after it has been processed. A good example of this is 
filtering an incoming stream of user-records by a field (eg:userId) and writing 
the filtered messages to their own stream. 
 
 In contrast, _stateful processing_ requires you to record some state about a 
message even after processing it. Consider the example of counting the number 
of unique users to a website every five minutes. This requires you to store 
information about each user seen thus far for de-duplication. Samza offers a 
fault-tolerant, scalable state-store for this purpose.
 
-## Time
+### Time
 Time is a fundamental concept in stream processing, especially in how it is 
modeled and interpreted by the system. Samza supports two notions of time. By 
default, all built-in Samza operators use processing time. In processing time, 
the timestamp of a message is determined by when it is processed by the system. 
For example, an event generated by a sensor could be processed by Samza several 
milliseconds later. 
 
 On the other hand, in event time, the timestamp of an event is determined by 
when it actually occurred at the source. For example, a sensor which generates 
an event could embed the time of occurrence as a part of the event itself. 
Samza provides event-time based processing by its integration with [Apache 
BEAM](https://beam.apache.org/documentation/runners/samza/).
 
-## Processing guarantee
+### Processing guarantee
 Samza supports at-least once processing. As the name implies, this ensures 
that each message in the input stream is processed by the system at-least once. 
This guarantees no data-loss even when there are failures, thereby making Samza 
a practical choice for building fault-tolerant applications.
 
 
 Next Steps: We are now ready to have a closer look at Samza’s architecture.
-## [Architecture 
&raquo;](/learn/documentation/{{site.version}}/architecture/architecture-overview.html)
+### [Architecture 
&raquo;](/learn/documentation/{{site.version}}/architecture/architecture-overview.html)
 

http://git-wip-us.apache.org/repos/asf/samza/blob/428861be/docs/learn/documentation/versioned/deployment/yarn.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/deployment/yarn.md 
b/docs/learn/documentation/versioned/deployment/yarn.md
index b32ba68..3a46cea 100644
--- a/docs/learn/documentation/versioned/deployment/yarn.md
+++ b/docs/learn/documentation/versioned/deployment/yarn.md
@@ -20,11 +20,10 @@ title: Run on YARN
 -->
 
 - [Introduction](#introduction)
-- [Starting your application on YARN](#starting-your-application-on-yarn)
+- [Running on YARN: Quickstart](#starting-your-application-on-yarn)
     - [Setting up a single node YARN 
cluster](#setting-up-a-single-node-yarn-cluster-optional)
     - [Submitting the application to YARN](#submitting-the-application-to-yarn)
 - [Application Master UI](#application-master-ui)
-- [Viewing logs](#viewing-logs)
 - [Configuration](#configuration)
     - [Configuring parallelism](#configuring-parallelism)
     - [Configuring resources](#configuring-resources)
@@ -45,13 +44,13 @@ title: Run on YARN
 - [Coordinator Internals](#coordinator-internals)
 
 
-## Introduction
+### Introduction
 
 Apache YARN is part of the Hadoop project and provides the ability to run 
distributed applications on a cluster. A YARN cluster minimally consists of a 
Resource Manager (RM) and multiple Node Managers (NM). The RM is responsible 
for managing the resources in the cluster and allocating them to applications. 
Every node in the cluster has an NM (Node Manager), which is responsible for 
managing containers on that node - starting them, monitoring their resource 
usage and reporting the same to the RM. 
 
 Applications are run on the cluster by implementing a coordinator called an 
ApplicationMaster (AM). The AM is responsible for requesting resources 
including CPU, memory from the Resource Manager (RM) on behalf of the 
application. Samza provides its own implementation of the AM for each job.
 
-## Running on YARN: Quickstart
+### Running on YARN: Quickstart
 
 We will demonstrate running a Samza application on YARN by using the 
`hello-samza` example. Lets first checkout our repository.
 
@@ -61,7 +60,7 @@ cd samza-hello-samza
 git checkout latest
 ```
 
-### Set up a single node YARN cluster
+#### Set up a single node YARN cluster
 
 You can use the `grid` script included as part of the 
[hello-samza](https://github.com/apache/samza-hello-samza/) repository to setup 
a single-node cluster. The script also starts Zookeeper and Kafka locally.
 
@@ -104,7 +103,7 @@ $ ./deploy/samza/bin/run-app.sh 
--config-factory=org.apache.samza.config.factori
 Congratulations, you've successfully submitted your first job to YARN! You can 
view the YARN Web UI to view its status. 
 
 
-## Application Master UI
+### Application Master UI
 
 The YARN RM provides a Web UI to view the status of applications in the 
cluster, their containers and logs. By default, it can be accessed from 
`localhost:8088` on the RM host. 
 
![diagram-medium](/img/{{site.version}}/learn/documentation/yarn/yarn-am-ui.png)
@@ -127,7 +126,7 @@ Samza's Application Master UI provides you the ability to 
view:
 
![diagram-small](/img/{{site.version}}/learn/documentation/yarn/am-runtime-configs.png)
 
 
-### Configurations
+### Configuration
 
 In this section, we'll look at configuring your jobs when running on YARN.
 

Reply via email to