kafka git commit: MINOR: Improve introduction section in docs to better cover connect and streams. Make uses and ecosystem pages stand alone.

jkreps Wed, 28 Sep 2016 16:34:32 -0700

Repository: kafka
Updated Branches:
  refs/heads/trunk 281fac9ed -> 16f85e84c



MINOR: Improve introduction section in docs to better cover connect and 
streams. Make uses and ecosystem pages stand alone.


Project: http://git-wip-us.apache.org/repos/asf/kafka/repo
Commit: http://git-wip-us.apache.org/repos/asf/kafka/commit/16f85e84
Tree: http://git-wip-us.apache.org/repos/asf/kafka/tree/16f85e84
Diff: http://git-wip-us.apache.org/repos/asf/kafka/diff/16f85e84

Branch: refs/heads/trunk
Commit: 16f85e84ce8c01329b3298d5fca8b27cfcacddb6
Parents: 281fac9
Author: Jay Kreps <jay.kr...@gmail.com>
Authored: Wed Sep 28 16:30:21 2016 -0700
Committer: Jay Kreps <jay.kr...@gmail.com>
Committed: Wed Sep 28 16:30:21 2016 -0700

----------------------------------------------------------------------
 docs/api.html           | 187 +++++++++++--------------------------------
 docs/documentation.html |  33 ++++----
 docs/ecosystem.html     |   2 -
 docs/introduction.html  | 148 +++++++++++++++++++++++++---------
 docs/uses.html          |   2 -
 5 files changed, 170 insertions(+), 202 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kafka/blob/16f85e84/docs/api.html
----------------------------------------------------------------------
diff --git a/docs/api.html b/docs/api.html
index 8e4409d..dc65ac8 100644
--- a/docs/api.html
+++ b/docs/api.html
@@ -15,13 +15,25 @@
  limitations under the License.
 -->
 
-Apache Kafka includes new java clients (in the org.apache.kafka.clients 
package). These are meant to supplant the older Scala clients, but for 
compatibility they will co-exist for some time. These clients are available in 
a separate jar with minimal dependencies, while the old Scala clients remain 
packaged with the server.
+Kafka includes four core apis:
+<ol>
+  <li>The <a href="#producerapi">Producer</a> API allows applications to send 
streams of data to topics in the Kafka cluster.
+  <li>The <a href="#consumerapi">Consumer</a> API allows applications to read 
streams of data from topics in the Kafka cluster.
+  <li>The <a href="#streamsapi">Streams</a> API allows transforming streams of 
data from input topics to output topics.
+  <li>The <a href="#producerapi">Connect</a> API allows implementing 
connectors that continually pull from some source system or application into 
Kafka or push from Kafka into some sink system or application.
+</ol>
+
+Kafka exposes all its functionality over a language independent protocol which 
has clients available in many programming languages. However only the Java 
clients are maintained as part of the main Kafka project, the others are 
available as independent open source projects. A list of non-Java clients is 
available <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/Clients";>here</a>.
 
 <h3><a id="producerapi" href="#producerapi">2.1 Producer API</a></h3>
 
-We recommend the new Java producer for all new development. The old Scala 
producers have been deprecated and will be removed in a future major release.
-The new Java producer is production tested and generally faster and more fully 
featured than the previous Scala clients. You can use this client by adding a 
dependency on
-the client jar using the following example maven co-ordinates (you can change 
the version numbers with new releases):
+The Producer API allows applications to send streams of data to topics in the 
Kafka cluster.
+<p>
+Examples showing how to use the producer are given in the
+<a 
href="http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html";
 title="Kafka 0.10.0 Javadoc">javadocs</a>.
+<p>
+To use the producer, you can use the following maven dependency:
+
 <pre>
        &lt;dependency&gt;
            &lt;groupId&gt;org.apache.kafka&lt;/groupId&gt;
@@ -30,26 +42,14 @@ the client jar using the following example maven 
co-ordinates (you can change th
        &lt;/dependency&gt;
 </pre>
 
-Examples showing how to use the producer are given in the
-<a 
href="http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html";
 title="Kafka 0.10.0 Javadoc">javadocs</a>.
-
-<p>
-For those interested in the legacy Scala producer api, information can be 
found <a href="http://kafka.apache.org/081/documentation.html#producerapi";>
-here</a>.
-</p>
-
 <h3><a id="consumerapi" href="#consumerapi">2.2 Consumer API</a></h3>
 
-We recommend the new Java consumer for all new development. The new Java 
consumer replaces the high-level ZooKeeper-based consumer and
-low-level consumer APIs (also known as old Scala consumers).
-
-To ensure a smooth upgrade path, the old Scala consumers are still maintained 
(although lacking features like security) and continue to work
-with the current Kafka clusters. The current plan is to deprecate them in the 
release after 0.10.1.0 and to remove them in a future major release.
-
-In the following sections we introduce new Java consumer API and the old Scala 
consumer APIs (both high-level ConsumerConnector and low-level SimpleConsumer).
-
-<h4><a id="newconsumerapi" href="#newconsumerapi">2.2.1 New Consumer 
API</a></h4>
-This new unified consumer API removes the distinction between the 0.8 
high-level and low-level consumer APIs. You can use this client by adding a 
dependency on the client jar using the following example maven co-ordinates 
(you can change the version numbers with new releases):
+The Consumer API allows applications to read streams of data from topics in 
the Kafka cluster.
+<p>
+Examples showing how to use the consumer are given in the
+<a 
href="http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html";
 title="Kafka 0.9.0 Javadoc">javadocs</a>.
+<p>
+To use the consumer, you can use the following maven dependency:
 <pre>
        &lt;dependency&gt;
            &lt;groupId&gt;org.apache.kafka&lt;/groupId&gt;
@@ -58,132 +58,37 @@ This new unified consumer API removes the distinction 
between the 0.8 high-level
        &lt;/dependency&gt;
 </pre>
 
-Examples showing how to use the consumer are given in the
-<a 
href="http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html";
 title="Kafka 0.10.0 Javadoc">javadocs</a>.
+<h3><a id="streamsapi" href="#streamsapi">Streams API</a></h3>
 
-<h4><a id="highlevelconsumerapi" href="#highlevelconsumerapi">2.2.2 Old High 
Level Consumer API</a></h4>
-<pre>
-class Consumer {
-  /**
-   *  Create a ConsumerConnector
-   *
-   *  @param config  at the minimum, need to specify the groupid of the 
consumer and the zookeeper
-   *                 connection string zookeeper.connect.
-   */
-  public static kafka.javaapi.consumer.ConsumerConnector 
createJavaConsumerConnector(ConsumerConfig config);
-}
-
-/**
- *  V: type of the message
- *  K: type of the optional key associated with the message
- */
-public interface kafka.javaapi.consumer.ConsumerConnector {
-  /**
-   *  Create a list of message streams of type T for each topic.
-   *
-   *  @param topicCountMap  a map of (topic, #streams) pair
-   *  @param decoder a decoder that converts from Message to T
-   *  @return a map of (topic, list of  KafkaStream) pairs.
-   *          The number of items in the list is #streams. Each stream supports
-   *          an iterator over message/metadata pairs.
-   */
-  public &lt;K,V&gt; Map&lt;String, List&lt;KafkaStream&lt;K,V&gt;&gt;&gt;
-    createMessageStreams(Map&lt;String, Integer&gt; topicCountMap, 
Decoder&lt;K&gt; keyDecoder, Decoder&lt;V&gt; valueDecoder);
-
-  /**
-   *  Create a list of message streams of type T for each topic, using the 
default decoder.
-   */
-  public Map&lt;String, List&lt;KafkaStream&lt;byte[], byte[]&gt;&gt;&gt; 
createMessageStreams(Map&lt;String, Integer&gt; topicCountMap);
-
-  /**
-   *  Create a list of message streams for topics matching a wildcard.
-   *
-   *  @param topicFilter a TopicFilter that specifies which topics to
-   *                    subscribe to (encapsulates a whitelist or a blacklist).
-   *  @param numStreams the number of message streams to return.
-   *  @param keyDecoder a decoder that decodes the message key
-   *  @param valueDecoder a decoder that decodes the message itself
-   *  @return a list of KafkaStream. Each stream supports an
-   *          iterator over its MessageAndMetadata elements.
-   */
-  public &lt;K,V&gt; List&lt;KafkaStream&lt;K,V&gt;&gt;
-    createMessageStreamsByFilter(TopicFilter topicFilter, int numStreams, 
Decoder&lt;K&gt; keyDecoder, Decoder&lt;V&gt; valueDecoder);
-
-  /**
-   *  Create a list of message streams for topics matching a wildcard, using 
the default decoder.
-   */
-  public List&lt;KafkaStream&lt;byte[], byte[]&gt;&gt; 
createMessageStreamsByFilter(TopicFilter topicFilter, int numStreams);
-
-  /**
-   *  Create a list of message streams for topics matching a wildcard, using 
the default decoder, with one stream.
-   */
-  public List&lt;KafkaStream&lt;byte[], byte[]&gt;&gt; 
createMessageStreamsByFilter(TopicFilter topicFilter);
-
-  /**
-   *  Commit the offsets of all topic/partitions connected by this connector.
-   */
-  public void commitOffsets();
-
-  /**
-   *  Shut down the connector
-   */
-  public void shutdown();
-}
-
-</pre>
-You can follow
-<a 
href="https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example"; 
title="Kafka 0.8 consumer example">this example</a> to learn how to use the 
high level consumer api.
-<h4><a id="simpleconsumerapi" href="#simpleconsumerapi">2.2.3 Old Simple 
Consumer API</a></h4>
-<pre>
-class kafka.javaapi.consumer.SimpleConsumer {
-  /**
-   *  Fetch a set of messages from a topic.
-   *
-   *  @param request specifies the topic name, topic partition, starting byte 
offset, maximum bytes to be fetched.
-   *  @return a set of fetched messages
-   */
-  public FetchResponse fetch(kafka.javaapi.FetchRequest request);
-
-  /**
-   *  Fetch metadata for a sequence of topics.
-   *
-   *  @param request specifies the versionId, clientId, sequence of topics.
-   *  @return metadata for each topic in the request.
-   */
-  public kafka.javaapi.TopicMetadataResponse 
send(kafka.javaapi.TopicMetadataRequest request);
-
-  /**
-   *  Get a list of valid offsets (up to maxSize) before the given time.
-   *
-   *  @param request a [[kafka.javaapi.OffsetRequest]] object.
-   *  @return a [[kafka.javaapi.OffsetResponse]] object.
-   */
-  public kafka.javaapi.OffsetResponse getOffsetsBefore(OffsetRequest request);
-
-  /**
-   * Close the SimpleConsumer.
-   */
-  public void close();
-}
-</pre>
-For most applications, the new Java Consumer API is the best option and it's 
the API we intend to support going forward. However, if you need to use the 
SimpleConsumer API, the logic will be a bit more complicated and you can follow 
the example <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example";
 title="Kafka 0.8 SimpleConsumer example">here</a>.
-
-<h3><a id="streamsapi" href="#streamsapi">2.3 Streams API</a></h3>
-
-As of the 0.10.0 release we have added a stream processing engine to Apache 
Kafka called Kafka Streams, which is a client library that lets users implement 
their own stream processing applications for data stored in Kafka topics.
-You can use Kafka Streams from within your Java applications by adding a 
dependency on the kafka-streams jar using the following maven co-ordinates:
+The <a href="#streamsapi">Streams</a> API allows transforming streams of data 
from input topics to output topics.
+<p>
+Examples showing how to use this library are given in the
+<a 
href="http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/streams/KafkaStreams.html";
 title="Kafka 0.10.0 Javadoc">javadocs</a>
+<p>
+Additional documentation on using the Streams API is available <a 
href="/documentation.html#streams">here</a>.
+<p>
+To use Kafka Streams you can use the following maven dependency:
 
 <pre>
        &lt;dependency&gt;
            &lt;groupId&gt;org.apache.kafka&lt;/groupId&gt;
            &lt;artifactId&gt;kafka-streams&lt;/artifactId&gt;
-           &lt;version&gt;0.10.0.1&lt;/version&gt;
+           &lt;version&gt;0.10.0.0&lt;/version&gt;
        &lt;/dependency&gt;
 </pre>
 
-Examples showing how to use this library are given in the
-<a 
href="http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/streams/KafkaStreams.html";
 title="Kafka 0.10.0 Javadoc">javadocs</a> and <a href="streams.html" 
title="Kafka Streams Overview">kafka streams overview</a>.
+<h3><a id="connectapi" href="#connectapi">Connect API</a></h3>
+
+The Connect API allows implementing connectors that continually pull from some 
source data system into Kafka or push from Kafka into some sink data system.
+<p>
+Many users of Connect won't need to use this API directly, though, they can 
use pre-built connectors without needing to write any code. Additional 
information on using Connect is available <a 
href="/documentation.html#connect">here</a>.
+<p>
+Those who want to implement custom connectors can see the <a 
href="http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/connect";>javadoc</a>.
 <p>
-    Please note that Kafka Streams is a new component of Kafka, and its public 
APIs may change in future releases.
-    We use the <b>@InterfaceStability.Unstable</b> annotation to denote 
classes whose APIs may change without backward-compatibility in future releases.
-</p>
+
+<h3><a id="legacyapis" href="#streamsapi">Legacy APIs</a></h3>
+
+<p>
+A more limited legacy producer and consumer api is also included in Kafka. 
These old Scala APIs are deprecated and only still available for compatability 
purposes. Information on them can be found here <a 
href="http://kafka.apache.org/081/documentation.html#producerapi";>
+here</a>.
+</p>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/kafka/blob/16f85e84/docs/documentation.html
----------------------------------------------------------------------
diff --git a/docs/documentation.html b/docs/documentation.html
index 95d1251..f4f1ddc 100644
--- a/docs/documentation.html
+++ b/docs/documentation.html
@@ -31,16 +31,13 @@ Prior releases: <a href="/07/documentation.html">0.7.x</a>, 
<a href="/08/documen
              <li><a href="#upgrade">1.5 Upgrading</a>
          </ul>
     </li>
-    <li><a href="#api">2. API</a>
+    <li><a href="#api">2. APIs</a>
           <ul>
               <li><a href="#producerapi">2.1 Producer API</a>
               <li><a href="#consumerapi">2.2 Consumer API</a>
-                  <ul>
-                      <li><a href="#newconsumerapi">2.2.1 New Consumer API</a>
-                      <li><a href="#highlevelconsumerapi">2.2.2 Old High Level 
Consumer API</a>
-                      <li><a href="#simpleconsumerapi">2.2.3 Old Simple 
Consumer API</a>
-                  </ul>
               <li><a href="#streamsapi">2.3 Streams API</a>
+                         <li><a href="#connectapi">2.4 Connect API</a>
+                         <li><a href="#legacyapis">2.5 Legacy APIs</a>
           </ul>
     </li>
     <li><a href="#configuration">3. Configuration</a>
@@ -110,11 +107,6 @@ Prior releases: <a 
href="/07/documentation.html">0.7.x</a>, <a href="/08/documen
                     <li><a href="#ext4">Ext4 Notes</a>
                 </ul>
               <li><a href="#monitoring">6.6 Monitoring</a>
-                <ul>
-                    <li><a href="#selector_monitoring">Common monitoring 
metrics for producer/consumer/connect</a></li>
-                    <li><a href="#new_producer_monitoring">New producer 
monitoring</a></li>
-                    <li><a href="#new_consumer_monitoring">New consumer 
monitoring</a></li>
-                </ul>
               <li><a href="#zk">6.7 ZooKeeper</a>
                 <ul>
                     <li><a href="#zkversion">Stable Version</a>
@@ -158,13 +150,18 @@ Prior releases: <a 
href="/07/documentation.html">0.7.x</a>, <a href="/08/documen
 </ul>
 
 <h2><a id="gettingStarted" href="#gettingStarted">1. Getting Started</a></h2>
-<!--#include virtual="introduction.html" -->
-<!--#include virtual="uses.html" -->
-<!--#include virtual="quickstart.html" -->
-<!--#include virtual="ecosystem.html" -->
-<!--#include virtual="upgrade.html" -->
-
-<h2><a id="api" href="#api">2. API</a></h2>
+  <h3><a id="introduction" href="#introduction">1.1 Introduction</a></h3>
+  <!--#include virtual="introduction.html" -->
+  <h3><a id="uses" href="#uses">1.2 Use Cases</a></h3>
+  <!--#include virtual="uses.html" -->
+  <h3><a id="quickstart" href="#quickstart">1.3 Quick Start</a></h3>
+  <!--#include virtual="quickstart.html" -->
+  <h3><a id="ecosystem" href="#ecosystem">1.4 Ecosystem</a></h3>
+  <!--#include virtual="ecosystem.html" -->
+  <h3><a id="upgrade" href="#upgrade">1.5 Upgrading From Previous 
Versions</a></h3>
+  <!--#include virtual="upgrade.html" -->
+
+<h2><a id="api" href="#api">2. APIs</a></h2>
 
 <!--#include virtual="api.html" -->
 

http://git-wip-us.apache.org/repos/asf/kafka/blob/16f85e84/docs/ecosystem.html
----------------------------------------------------------------------
diff --git a/docs/ecosystem.html b/docs/ecosystem.html
index 73d5706..5fbcec5 100644
--- a/docs/ecosystem.html
+++ b/docs/ecosystem.html
@@ -15,6 +15,4 @@
  limitations under the License.
 -->
 
-<h3><a id="ecosystem" href="#ecosystem">1.4 Ecosystem</a></h3>
-
 There are a plethora of tools that integrate with Kafka outside the main 
distribution. The <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem";> ecosystem 
page</a> lists many of these, including stream processing systems, Hadoop 
integration, monitoring, and deployment tools.

http://git-wip-us.apache.org/repos/asf/kafka/blob/16f85e84/docs/introduction.html
----------------------------------------------------------------------
diff --git a/docs/introduction.html b/docs/introduction.html
index 3fbf7a3..147210f 100644
--- a/docs/introduction.html
+++ b/docs/introduction.html
@@ -14,43 +14,65 @@
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
-
-<h3><a id="introduction" href="#introduction">1.1 Introduction</a></h3>
-Kafka is a distributed, partitioned, replicated commit log service. It 
provides the functionality of a messaging system, but with a unique design.
+Kafka is <i>a distributed streaming platform</i>. What exactly does that mean?
+<p>
+We think of a streaming platform as having three key capabilities:
+<ol>
+       <li>It let's you publish and subscribe to streams of records. In this 
respect it is similar to a message queue or enterprise messaging system.
+       <li>It let's you store streams of records in a fault-tolerant way.
+       <li>It let's you process streams of records as they occur.
+</ol>
+<p>
+What is Kafka good for?
+<p>
+It gets used for two broad classes of application:
+<ol>
+  <li>Building real-time streaming data pipelines that reliably get data 
between systems or applications
+  <li>Building real-time streaming applications that transform or react to the 
streams of data
+</ol>
 <p>
-What does all that mean?
+To understand how Kafka does these things, let's dive in and explore Kafka's 
capabilities from the bottom up.
 <p>
-First let's review some basic messaging terminology:
+First a few concepts:
 <ul>
-    <li>Kafka maintains feeds of messages in categories called <i>topics</i>.
-    <li>We'll call processes that publish messages to a Kafka topic 
<i>producers</i>.
-    <li>We'll call processes that subscribe to topics and process the feed of 
published messages <i>consumers</i>.
-    <li>Kafka is run as a cluster comprised of one or more servers each of 
which is called a <i>broker</i>.
+       <li>Kafka is run as a cluster on one or more servers.
+    <li>The Kafka cluster stores streams of <i>records</i> in categories 
called <i>topics</i>.
+       <li>Each record consists of a key, a value, and a timestamp.
 </ul>
-
-So, at a high level, producers send messages over the network to the Kafka 
cluster which in turn serves them up to consumers like this:
-<div style="text-align: center; width: 100%">
-  <img src="images/producer_consumer.png">
+Kafka has four core APIs:
+<div style="float: right">
+  <img src="images/kafka-apis.png" style="width:400px">
 </div>
-
-Communication between the clients and the servers is done with a simple, 
high-performance, language agnostic <a 
href="https://kafka.apache.org/protocol.html";>TCP protocol</a>. We provide a 
Java client for Kafka, but clients are available in <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/Clients";>many 
languages</a>.
+<ul>
+    <li>The <a href="/documentation.html#producerapi">Producer API</a> allows 
an application to publish a stream records to one or more Kafka topics.
+    <li>The <a href="/documentation.html#consumerapi">Consumer API</a> allows 
an application to subscribe to one or more topics and process the stream of 
records produced to them.
+       <li>The <a href="/documentation.html#streams">Streams API</a> allows an 
application to act as a <i>stream processor</i>, consuming an input stream from 
one or more topics and producing an output stream to one or more output topics, 
effectively transforming the input streams to output streams.
+       <li>The <a href="/documentation.html#connect">Connector API</a> allows 
building and running reusable producers or consumers that connect Kafka topics 
to existing applications or data systems. For example, a connector to a 
relational database might capture every change to 
+</ul>
+<p>
+In Kafka the communication between the clients and the servers is done with a 
simple, high-performance, language agnostic <a 
href="https://kafka.apache.org/protocol.html";>TCP protocol</a>. This protocol 
is versioned and maintains backwards compatibility with older version. We 
provide a Java client for Kafka, but clients are available in <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/Clients";>many 
languages</a>.
 
 <h4><a id="intro_topics" href="#intro_topics">Topics and Logs</a></h4>
-Let's first dive into the high-level abstraction Kafka provides&mdash;the 
topic.
+Let's first dive into the core abstraction Kafka provides for a stream of 
records&mdash;the topic.
 <p>
-A topic is a category or feed name to which messages are published. For each 
topic, the Kafka cluster maintains a partitioned log that looks like this:
+A topic is a category or feed name to which records are published. Topics in 
Kafka are always multi-subscriber; that is, a topic can have zero, one, or many 
consumers that subscribe to the data written to it.
+<p>
+For each topic, the Kafka cluster maintains a partitioned log that looks like 
this:
 <div style="text-align: center; width: 100%">
   <img src="images/log_anatomy.png">
 </div>
-Each partition is an ordered, immutable sequence of messages that is 
continually appended to&mdash;a commit log. The messages in the partitions are 
each assigned a sequential id number called the <i>offset</i> that uniquely 
identifies each message within the partition.
+Each partition is an ordered, immutable sequence of records that is 
continually appended to&mdash;a structured commit log. The records in the 
partitions are each assigned a sequential id number called the <i>offset</i> 
that uniquely identifies each record within the partition.
 <p>
-The Kafka cluster retains all published messages&mdash;whether or not they 
have been consumed&mdash;for a configurable period of time. For example if the 
log retention is set to two days, then for the two days after a message is 
published it is available for consumption, after which it will be discarded to 
free up space. Kafka's performance is effectively constant with respect to data 
size so retaining lots of data is not a problem.
+The Kafka cluster retains all published records&mdash;whether or not they have 
been consumed&mdash;using a configurable retention period. For example if the 
retention policy is set to two days, then for the two days after a record is 
published, it is available for consumption, after which it will be discarded to 
free up space. Kafka's performance is effectively constant with respect to data 
size so storing data for a long time is not a problem.
 <p>
-In fact the only metadata retained on a per-consumer basis is the position of 
the consumer in the log, called the "offset". This offset is controlled by the 
consumer: normally a consumer will advance its offset linearly as it reads 
messages, but in fact the position is controlled by the consumer and it can 
consume messages in any order it likes. For example a consumer can reset to an 
older offset to reprocess.
+<div style="float:right">
+  <img src="images/log_consumer.png" style="width:400px">
+</div>
+In fact, the only metadata retained on a per-consumer basis is the offset or 
position of that consumer in the log. This offset is controlled by the 
consumer: normally a consumer will advance its offset linearly as it reads 
records, but, in fact, since the position is controlled by the consumer it can 
consume records in any order it likes. For example a consumer can reset to an 
older offset to reprocess data from the past or skip ahead to the most recent 
record and start consuming from "now".
 <p>
 This combination of features means that Kafka consumers are very 
cheap&mdash;they can come and go without much impact on the cluster or on other 
consumers. For example, you can use our command line tools to "tail" the 
contents of any topic without changing what is consumed by any existing 
consumers.
 <p>
-The partitions in the log serve several purposes. First, they allow the log to 
scale beyond a size that will fit on a single server. Each individual partition 
must fit on the server that hosts it, but a topic may have many partitions so 
it can handle an arbitrary amount of data. Second they act as the unit of 
parallelism&mdash;more on that in a bit.
+The partitions in the log serve several purposes. First, they allow the log to 
scale beyond a size that will fit on a single server. Each individual partition 
must fit on the servers that host it, but a topic may have many partitions so 
it can handle an arbitrary amount of data. Second they act as the unit of 
parallelism&mdash;more on that in a bit.
 
 <h4><a id="intro_distribution" href="#intro_distribution">Distribution</a></h4>
 
@@ -60,40 +82,88 @@ Each partition has one server which acts as the "leader" 
and zero or more server
 
 <h4><a id="intro_producers" href="#intro_producers">Producers</a></h4>
 
-Producers publish data to the topics of their choice. The producer is 
responsible for choosing which message to assign to which partition within the 
topic. This can be done in a round-robin fashion simply to balance load or it 
can be done according to some semantic partition function (say based on some 
key in the message). More on the use of partitioning in a second.
+Producers publish data to the topics of their choice. The producer is 
responsible for choosing which record to assign to which partition within the 
topic. This can be done in a round-robin fashion simply to balance load or it 
can be done according to some semantic partition function (say based on some 
key in the record). More on the use of partitioning in a second!
 
 <h4><a id="intro_consumers" href="#intro_consumers">Consumers</a></h4>
 
-Messaging traditionally has two models: <a 
href="http://en.wikipedia.org/wiki/Message_queue";>queuing</a> and <a 
href="http://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern";>publish-subscribe</a>.
 In a queue, a pool of consumers may read from a server and each message goes 
to one of them; in publish-subscribe the message is broadcast to all consumers. 
Kafka offers a single consumer abstraction that generalizes both of 
these&mdash;the <i>consumer group</i>.
-<p>
-Consumers label themselves with a consumer group name, and each message 
published to a topic is delivered to one consumer instance within each 
subscribing consumer group. Consumer instances can be in separate processes or 
on separate machines.
+Consumers label themselves with a <i>consumer group</i> name, and each record 
published to a topic is delivered to one consumer instance within each 
subscribing consumer group. Consumer instances can be in separate processes or 
on separate machines.
 <p>
-If all the consumer instances have the same consumer group, then this works 
just like a traditional queue balancing load over the consumers.
+If all the consumer instances have the same consumer group, then the records 
will effectively be load balanced over the consumer instances.
 <p>
-If all the consumer instances have different consumer groups, then this works 
like publish-subscribe and all messages are broadcast to all consumers.
-<p>
-More commonly, however, we have found that topics have a small number of 
consumer groups, one for each "logical subscriber". Each group is composed of 
many consumer instances for scalability and fault tolerance. This is nothing 
more than publish-subscribe semantics where the subscriber is a cluster of 
consumers instead of a single process.
-<p>
-
+If all the consumer instances have different consumer groups, then each record 
will be broadcast to all the consumer processes.
 <div style="float: right; margin: 20px; width: 500px" class="caption">
   <img src="images/consumer-groups.png"><br>
   A two server Kafka cluster hosting four partitions (P0-P3) with two consumer 
groups. Consumer group A has two consumer instances and group B has four.
 </div>
 <p>
-Kafka has stronger ordering guarantees than a traditional messaging system, 
too.
+More commonly, however, we have found that topics have a small number of 
consumer groups, one for each "logical subscriber". Each group is composed of 
many consumer instances for scalability and fault tolerance. This is nothing 
more than publish-subscribe semantics where the subscriber is a cluster of 
consumers instead of a single process.
 <p>
-A traditional queue retains messages in-order on the server, and if multiple 
consumers consume from the queue then the server hands out messages in the 
order they are stored. However, although the server hands out messages in 
order, the messages are delivered asynchronously to consumers, so they may 
arrive out of order on different consumers. This effectively means the ordering 
of the messages is lost in the presence of parallel consumption. Messaging 
systems often work around this by having a notion of "exclusive consumer" that 
allows only one process to consume from a queue, but of course this means that 
there is no parallelism in processing.
+The way consumption is implemented in Kafka is by dividing up the partitions 
in the log over the consumer instances so that each instance is the exclusive 
consumer of a "fair share" of partitions at any point in time. This process of 
maintaining membership in the group is handled by the Kafka protocol 
dynamically. If new instances join the group they will take over some 
partitions from other members of the group; if an instance dies, its partitions 
will be distributed to the remaining instances.
 <p>
-Kafka does it better. By having a notion of parallelism&mdash;the 
partition&mdash;within the topics, Kafka is able to provide both ordering 
guarantees and load balancing over a pool of consumer processes. This is 
achieved by assigning the partitions in the topic to the consumers in the 
consumer group so that each partition is consumed by exactly one consumer in 
the group. By doing this we ensure that the consumer is the only reader of that 
partition and consumes the data in order. Since there are many partitions this 
still balances the load over many consumer instances. Note however that there 
cannot be more consumer instances in a consumer group than partitions.
-<p>
-Kafka only provides a total order over messages <i>within</i> a partition, not 
between different partitions in a topic. Per-partition ordering combined with 
the ability to partition data by key is sufficient for most applications. 
However, if you require a total order over messages this can be achieved with a 
topic that has only one partition, though this will mean only one consumer 
process per consumer group.
+Kafka only provides a total order over records <i>within</i> a partition, not 
between different partitions in a topic. Per-partition ordering combined with 
the ability to partition data by key is sufficient for most applications. 
However, if you require a total order over records this can be achieved with a 
topic that has only one partition, though this will mean only one consumer 
process per consumer group.
 
 <h4><a id="intro_guarantees" href="#intro_guarantees">Guarantees</a></h4>
 
 At a high-level Kafka gives the following guarantees:
 <ul>
-  <li>Messages sent by a producer to a particular topic partition will be 
appended in the order they are sent. That is, if a message M1 is sent by the 
same producer as a message M2, and M1 is sent first, then M1 will have a lower 
offset than M2 and appear earlier in the log.
-  <li>A consumer instance sees messages in the order they are stored in the 
log.
-  <li>For a topic with replication factor N, we will tolerate up to N-1 server 
failures without losing any messages committed to the log.
+  <li>Messages sent by a producer to a particular topic partition will be 
appended in the order they are sent. That is, if a record M1 is sent by the 
same producer as a record M2, and M1 is sent first, then M1 will have a lower 
offset than M2 and appear earlier in the log.
+  <li>A consumer instance sees records in the order they are stored in the log.
+  <li>For a topic with replication factor N, we will tolerate up to N-1 server 
failures without losing any records committed to the log.
 </ul>
 More details on these guarantees are given in the design section of the 
documentation.
+
+<h4><a id="kafka_mq" href="#kafka_mq">Kafka as a Messaging System</a></h4>
+
+How does Kafka's notion of streams compare to a traditional enterprise 
messaging system?
+<p>
+Messaging traditionally has two models: <a 
href="http://en.wikipedia.org/wiki/Message_queue";>queuing</a> and <a 
href="http://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern";>publish-subscribe</a>.
 In a queue, a pool of consumers may read from a server and each record goes to 
one of them; in publish-subscribe the record is broadcast to all consumers. 
Each of these two models has a strength and a weakness. The strength of queuing 
is that it allows you to divide up the processing of data over multiple 
consumer instances, which lets you scale your processing. Unfortunately queues 
aren't multi-subscriber&mdash;once one process reads the data it's gone. 
Publish-subscribe allows you broadcast data to multiple processes, but has no 
way of scaling processing since every message goes to every subscriber.
+<p>
+The consumer group concept in Kafka generalizes these two concepts. As with a 
queue the consumer group allows you to divide up processing over a collection 
of processes (the members of the consumer group). As with publish-subscribe, 
Kafka allows you to broadcast messages to multiple consumer groups.
+<p>
+The advantage of Kafka's model is that every topic has both these 
properties&mdash;it can scale processing and is also 
multi-subscriber&mdash;there is no need to choose one or the other.
+<p>
+Kafka has stronger ordering guarantees than a traditional messaging system, 
too.
+<p>
+A traditional queue retains records in-order on the server, and if multiple 
consumers consume from the queue then the server hands out records in the order 
they are stored. However, although the server hands out records in order, the 
records are delivered asynchronously to consumers, so they may arrive out of 
order on different consumers. This effectively means the ordering of the 
records is lost in the presence of parallel consumption. Messaging systems 
often work around this by having a notion of "exclusive consumer" that allows 
only one process to consume from a queue, but of course this means that there 
is no parallelism in processing.
+<p>
+Kafka does it better. By having a notion of parallelism&mdash;the 
partition&mdash;within the topics, Kafka is able to provide both ordering 
guarantees and load balancing over a pool of consumer processes. This is 
achieved by assigning the partitions in the topic to the consumers in the 
consumer group so that each partition is consumed by exactly one consumer in 
the group. By doing this we ensure that the consumer is the only reader of that 
partition and consumes the data in order. Since there are many partitions this 
still balances the load over many consumer instances. Note however that there 
cannot be more consumer instances in a consumer group than partitions.
+
+<h4>Kafka as a Storage System</h4>
+
+Any message queue that allows publishing messages decoupled from consuming 
them is effectively acting as a storage system for the in-flight messages. What 
is different about Kafka is that it is a very good storage system.
+<p>
+Data written to Kafka is written to disk and replicated for fault-tolerance. 
Kafka allows producers to wait on acknowledgement so that a write isn't 
considered complete until it is fully replicated and guaranteed to persist even 
if the server written to fails.
+<p>
+The disk structures Kafka uses scale well&mdash;Kafka will perform the same 
whether you have 50 KB or 50 TB of persistent data on the server.
+<p>
+As a result of taking storage seriously and allowing the clients to control 
their read position, you can think of Kafka as a kind of special purpose 
distributed filesystem dedicated to high-performance, low-latency commit log 
storage, replication, and propagation.
+
+<h4>Kafka for Stream Processing</h4>
+<p>
+It isn't enough to just read, write, and store streams of data, the purpose is 
to enable real-time processing of streams.
+<p>
+In Kafka a stream processor is anything that takes continual streams of  data 
from input topics, performs some processing on this input, and produces 
continual streams of data to output topics.
+<p>
+For example a retail application might take in input streams of sales and 
shipments, and output a stream of reorders and price adjustments computed off 
this data.
+<p>
+It is possible to do simple processing directly using the producer and 
consumer APIs. However for more complex transformations Kafka provides a fully 
integrated <a href="/streams.html">Streams API</a>. This allows building 
applications that do non-trivial processing that compute aggregations off of 
streams or join streams together.
+<p>
+This facility helps solve the hard problems this type of application faces: 
handling out-of-order data, reprocessing input as code changes, performing 
stateful computations, etc.
+<p>
+The streams API builds on the core primitives Kafka provides: it uses the 
producer and consumer APIs for input, uses Kafka for stateful storage, and uses 
the same group mechanism for fault tolerance among the stream processor 
instances.
+
+<h4>Putting the Pieces Together</h4>
+
+This combination of messaging, storage, and stream processing may seem unusual 
but it is essential to Kafka's role as a streaming platform.
+<p>
+A distributed file system like HDFS allows storing static files for batch 
processing. Effectively a system like this allows storing and processing 
<i>historical</i> data from the past.
+<p>
+A traditional enterprise messaging system allows processing future messages 
that will arrive after you subscribe. Applications built in this way process 
future data as it arrives.
+<p>
+Kafka combines both of these capabilities, and the combination is critical 
both for Kafka usage as a platform for streaming applications as well as for 
streaming data pipelines.
+<p>
+By combining storage and low-latency subscriptions, streaming applications can 
treat both past and future data the same way. That is a single application can 
process historical, stored data but rather than ending when it reaches the last 
record it can keep processing as future data arrives. This is a generalized 
notion of stream processing that subsumes batch processing as well as 
message-driven applications.
+<p>
+Likewise for streaming data pipelines the combination of subscription to 
real-time events make it possible to use Kafka for very low-latency pipelines; 
but the ability to store data reliably make it possible to use it for critical 
data where the delivery of data must be guaranteed or for integration with 
offline systems that load data only periodically or may go down for extended 
periods of time for maintenance. The stream processing facilities make it 
possible to transform data as it arrives.
+<p>
+For more information on the guarantees, apis, and capabilities Kafka provides 
see the rest of the <a href="/documentation.html">documentation</a>.

http://git-wip-us.apache.org/repos/asf/kafka/blob/16f85e84/docs/uses.html
----------------------------------------------------------------------
diff --git a/docs/uses.html b/docs/uses.html
index 5b97272..6214ee6 100644
--- a/docs/uses.html
+++ b/docs/uses.html
@@ -15,8 +15,6 @@
  limitations under the License.
 -->
 
-<h3><a id="uses" href="#uses">1.2 Use Cases</a></h3>
-
 Here is a description of a few of the popular use cases for Apache Kafka. For 
an overview of a number of these areas in action, see <a 
href="http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying";>this
 blog post</a>.
 
 <h4><a id="uses_messaging" href="#uses_messaging">Messaging</a></h4>

kafka git commit: MINOR: Improve introduction section in docs to better cover connect and streams. Make uses and ecosystem pages stand alone.

Reply via email to