Author: junrao
Date: Mon Feb 24 18:10:16 2014
New Revision: 1571376
URL: http://svn.apache.org/r1571376
Log:
KAFKA-1269; Minor typos in documentation; patched by Evan Zacks; reviewed by
Jun Rao
Modified:
kafka/site/08/configuration.html
kafka/site/08/design.html
kafka/site/08/introduction.html
kafka/site/08/ops.html
Modified: kafka/site/08/configuration.html
URL:
http://svn.apache.org/viewvc/kafka/site/08/configuration.html?rev=1571376&r1=1571375&r2=1571376&view=diff
==============================================================================
--- kafka/site/08/configuration.html (original)
+++ kafka/site/08/configuration.html Mon Feb 24 18:10:16 2014
@@ -19,7 +19,7 @@ The essential configurations are the fol
<tr>
<td>broker.id</td>
<td></td>
- <td>Each broker is uniquely identified by a non-negative integer id.
This id serves as the brokers "name", and allows the broker to be moved to a
different host/port without confusing consumers. You can choose any number you
like so long as it is unique.
+ <td>Each broker is uniquely identified by a non-negative integer id.
This id serves as the brokers "name" and allows the broker to be moved to a
different host/port without confusing consumers. You can choose any number you
like so long as it is unique.
</td>
</tr>
<tr>
@@ -42,7 +42,7 @@ Zookeeper also allows you to add a "chro
<tr>
<td>message.max.bytes</td>
<td>1000000</td>
- <td>The maximum size of a message that the server can receive. It is
important that this property be in sync with the maximum fetch size your
consumers use or else an unruly consumer will be able to publish messages too
large for consumers to consume.</td>
+ <td>The maximum size of a message that the server can receive. It is
important that this property be in sync with the maximum fetch size your
consumers use or else an unruly producer will be able to publish messages too
large for consumers to consume.</td>
</tr>
<tr>
<td>num.network.threads</td>
@@ -94,7 +94,7 @@ Zookeeper also allows you to add a "chro
<tr>
<td>log.segment.bytes.per.topic</td>
<td>""</td>
- <td>This setting allows overriding log.segment.bytes on a per-topic
basis</td>
+ <td>This setting allows overriding log.segment.bytes on a per-topic
basis.</td>
</tr>
<tr>
<td>log.roll.hours</td>
@@ -119,7 +119,7 @@ Zookeeper also allows you to add a "chro
<tr>
<td>log.retention.bytes</td>
<td>-1</td>
- <td>The amount of data to retain in the log for each topic-partitions.
Note that this is the limit per-partition so multiple by the number of
partitions to get the total data retained for the topic. Also note that if both
log.retention.hours and log.retention.bytes are both set we delete a segment
when either limit is exceeded.</td>
+ <td>The amount of data to retain in the log for each topic-partitions.
Note that this is the limit per-partition so multiply by the number of
partitions to get the total data retained for the topic. Also note that if both
log.retention.hours and log.retention.bytes are both set we delete a segment
when either limit is exceeded.</td>
</tr>
<tr>
<td>log.retention.bytes.per.topic</td>
@@ -185,7 +185,7 @@ Zookeeper also allows you to add a "chro
<tr>
<td>replica.lag.time.max.ms</td>
<td>10000</td>
- <td>If a follower hasn't sent any fetch requests for this window of
time, the leader will remove the follower from ISR and treat it as dead.</td>
+ <td>If a follower hasn't sent any fetch requests for this window of
time, the leader will remove the follower from ISR (in-sync replicas) and treat
it as dead.</td>
</tr>
<tr>
<td>replica.lag.max.messages</td>
@@ -247,12 +247,12 @@ Zookeeper also allows you to add a "chro
<tr>
<td>zookeeper.connection.timeout.ms</td>
<td>6000</td>
- <td>The max time that the client waits to establish a connection to
zookeeper.</td>
+ <td>The maximum amount of time that the client waits to establish a
connection to zookeeper.</td>
</tr>
<tr>
<td>zookeeper.sync.time.ms</td>
<td>2000</td>
- <td>How far a ZK follower can be behind a ZK leader</td>
+ <td>How far a ZK follower can be behind a ZK leader.</td>
</tr>
<tr>
<td>controlled.shutdown.enable</td>
Modified: kafka/site/08/design.html
URL:
http://svn.apache.org/viewvc/kafka/site/08/design.html?rev=1571376&r1=1571375&r2=1571376&view=diff
==============================================================================
--- kafka/site/08/design.html (original)
+++ kafka/site/08/design.html Mon Feb 24 18:10:16 2014
@@ -153,14 +153,14 @@ These are not the strongest possible sem
<p>
Not all use cases require such strong guarantees. For uses which are latency
sensitive we allow the producer to specify the durability level it desires. If
the producer specifies that it wants to wait on the message being committed
this can take on the order of 10 ms. However the producer can also specify that
it wants to perform the send completely asynchronously or that it wants to wait
only until the leader (but not necessarily the followers) have the message.
<p>
-Now let's describe the semantics from the point-of-view of the consumer. All
replicas have the exact same log with the same offsets. The consumer controls
it's position in this log. If the consumer never crashed it could just store
this position in memory, but if the producer fails and we want this topic
partition to be taken over by another process the new process will need to
choose an appropriate position from which to start processing. Let's say the
consumer reads some messages it has several options for processing the messages
and updating its position.
+Now let's describe the semantics from the point-of-view of the consumer. All
replicas have the exact same log with the same offsets. The consumer controls
its position in this log. If the consumer never crashed it could just store
this position in memory, but if the producer fails and we want this topic
partition to be taken over by another process the new process will need to
choose an appropriate position from which to start processing. Let's say the
consumer reads some messages -- it has several options for processing the
messages and updating its position.
<ol>
<li>It can read the messages, then save its position in the log, and finally
process the messages. In this case there is a possibility that the consumer
process crashes after saving its position but before saving the output of its
message processing. In this case the process that took over processing would
start at the saved position even though a few messages prior to that position
had not been processed. This corresponds to "at-most-once" semantics as in the
case of a consumer failure messages may not be processed.
<li>It can read the messages, process the messages, and finally save its
position. In this case there is a possibility that the consumer process crashes
after processing messages but before saving its position. In this case when the
new process takes over the first few messages it receives will already have
been processed. This corresponds to the "at-least-once" semantics in the case
of consumer failure. In many cases messages have a primary key and so the
updates are idempotent (receiving the same message twice just overwrites a
record with another copy of itself).
- <li>So what about exactly once semantics (i.e. the thing you actually want)?
The limitation here is not actually a feature of the messaging system but
rather the need to co-ordinate the consumers position with what is actually
stored as output. The classic way of achieving this would be to introduce a
two-phase commit between the storage for the consumer position and the storage
of the consumers output. But this can be handled more simply and generally by
simply letting the consumer store its offset in the same place as its output.
This is better because many of the output systems a consumer might want to
write to will not support a two-phase commit. As example of this our Hadoop ETL
that populates data in HDFS stores its offsets in HDFS with the data it reads
so that it is guaranteed that either data and offsets are both updated or
neither is. We follow similar patterns for many other data systems which
require these stronger semantics and for which the messages do not have a pri
mary key to allow for deduplication.
+ <li>So what about exactly once semantics (i.e. the thing you actually want)?
The limitation here is not actually a feature of the messaging system but
rather the need to co-ordinate the consumer's position with what is actually
stored as output. The classic way of achieving this would be to introduce a
two-phase commit between the storage for the consumer position and the storage
of the consumers output. But this can be handled more simply and generally by
simply letting the consumer store its offset in the same place as its output.
This is better because many of the output systems a consumer might want to
write to will not support a two-phase commit. As an example of this, our Hadoop
ETL that populates data in HDFS stores its offsets in HDFS with the data it
reads so that it is guaranteed that either data and offsets are both updated or
neither is. We follow similar patterns for many other data systems which
require these stronger semantics and for which the messages do not have
a primary key to allow for deduplication.
</ol>
<p>
-So effectively Kafka guarantees at-least-once delivery by default and allows
the user to implement at most once delivery by disabling retries on the
producer and committing its offset prior to processing a batch of messages.
Exactly-once delivery requires co-operation with the destination storage system
but Kafka gives the offset which makes implementing this straight-forward.
+So effectively Kafka guarantees at-least-once delivery by default and allows
the user to implement at most once delivery by disabling retries on the
producer and committing its offset prior to processing a batch of messages.
Exactly-once delivery requires co-operation with the destination storage system
but Kafka provides the offset which makes implementing this straight-forward.
<h3><a id="replication">4.7 Replication</a></h3>
<p>
@@ -189,7 +189,7 @@ Kafka will remain available in the prese
<h4>Replicated Logs: Quorums, ISRs, and State Machines (Oh my!)</h4>
-At it's heart a Kafka partition is a replicated log. The replicated log is one
of the most basic primitives in distributed data systems, and there are many
approaches for implementing one. A replicated log can be used by other systems
as a primitive for implementing other distributed systems in the <a
href="http://en.wikipedia.org/wiki/State_machine_replication">state-machine
style</a>.
+At its heart a Kafka partition is a replicated log. The replicated log is one
of the most basic primitives in distributed data systems, and there are many
approaches for implementing one. A replicated log can be used by other systems
as a primitive for implementing other distributed systems in the <a
href="http://en.wikipedia.org/wiki/State_machine_replication">state-machine
style</a>.
<p>
A replicated log models the process of coming into consensus on the order of a
series of values (generally numbering the log entries 0, 1, 2, ...). There are
many ways to implement this, but the simplest and fastest is with a leader who
chooses the ordering of values provided to it. As long as the leader remains
alive, all followers need to only copy the values and ordering, the leader
chooses.
<p>
Modified: kafka/site/08/introduction.html
URL:
http://svn.apache.org/viewvc/kafka/site/08/introduction.html?rev=1571376&r1=1571375&r2=1571376&view=diff
==============================================================================
--- kafka/site/08/introduction.html (original)
+++ kafka/site/08/introduction.html Mon Feb 24 18:10:16 2014
@@ -43,7 +43,7 @@ Each partition has one server which acts
<h4>Producers</h4>
-Producers publish data to the topics of their choice. The producer is able to
chose which message to assign to which partition within the topic. This can be
done in a round-robin fashion simply to balance load or it can be done
according to some semantic partition function (say based on some key in the
message). More on the use of partitioning in a second.
+Producers publish data to the topics of their choice. The producer is able to
choose which message to assign to which partition within the topic. This can be
done in a round-robin fashion simply to balance load or it can be done
according to some semantic partition function (say based on some key in the
message). More on the use of partitioning in a second.
<h4>Consumers</h4>
@@ -53,7 +53,7 @@ Consumers label themselves with a consum
<p>
If all the consumer instances have the same consumer group, then this works
just like a traditional queue balancing load over the consumers.
<p>
-If all the consumers instances have different consumer groups then this works
like publish-subscribe and all messages are broadcast to all consumers.
+If all the consumer instances have different consumer groups, then this works
like publish-subscribe and all messages are broadcast to all consumers.
<p>
More commonly, however, we have found that topics have a small number of
consumer groups, one for each "logical subscriber". Each group is composed of
many consumer instances for scalability and fault tolerance. This is nothing
more than publish-subscribe semantics where the subscriber is cluster of
consumers instead of a single process.
<p>
@@ -63,9 +63,9 @@ More commonly, however, we have found th
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer
groups. Consumer group A has two consumer instances and group B has four.
</div>
<p>
-Kafka has stronger ordering guarantees than a traditional messaging system too.
+Kafka has stronger ordering guarantees than a traditional messaging system,
too.
<p>
-A traditional queue retains messages in-order on the server, and if multiple
consumers consume from the queue then the server hands out messages in the
order they are stored. However although the server hands out messages in order,
the messages are delivered asynchronously to consumers, so they may arrive out
of order on different consumers. This effectively means the ordering of the
messages is lost in the presence of parallel consumption. Messaging systems
often work around this by having a notion of "exclusive consumer" that allows
only on process to consume from a queue, but of course this means that there is
no parallelism in processing.
+A traditional queue retains messages in-order on the server, and if multiple
consumers consume from the queue then the server hands out messages in the
order they are stored. However, although the server hands out messages in
order, the messages are delivered asynchronously to consumers, so they may
arrive out of order on different consumers. This effectively means the ordering
of the messages is lost in the presence of parallel consumption. Messaging
systems often work around this by having a notion of "exclusive consumer" that
allows only one process to consume from a queue, but of course this means that
there is no parallelism in processing.
<p>
Kafka does it better. By having a notion of parallelism—the
partition—within the topics, Kafka is able to provide both ordering
guarantees and load balancing over a pool of consumer processes. This is
achieved by assigning the partitions in the topic to the consumers in the
consumer group so that each partition is consumed by exactly one consumer in
the group. By doing this we ensure that the consumer is the only reader of that
partition and consumes the data in order. Since there are many partitions this
still balances the load over many consumer instances. Note however that there
cannot be more consumer instances than partitions.
<p>
@@ -73,10 +73,10 @@ Kafka only provides a total order over m
<h4>Guarantees</h4>
-At a high-level Kafka gives the following guarantees
+At a high-level Kafka gives the following guarantees:
<ul>
- <li>Messages sent by a producer to a particular topic partition will be
appended in the order they are sent. That is if a message M1 is sent by the
same producer as a message M2, and M1 is sent first, then M1 will have a lower
offset then M2 and appear earlier in the log.
+ <li>Messages sent by a producer to a particular topic partition will be
appended in the order they are sent. That is, if a message M1 is sent by the
same producer as a message M2, and M1 is sent first, then M1 will have a lower
offset than M2 and appear earlier in the log.
<li>A consumer instance sees messages in the order they are stored in the
log.
<li>For a topic with replication factor N, we will tolerate up to N-1 server
failures without losing any messages committed to the log.
</ul>
-More details on these guarantees are given in the design section of the
documentation.
\ No newline at end of file
+More details on these guarantees are given in the design section of the
documentation.
Modified: kafka/site/08/ops.html
URL:
http://svn.apache.org/viewvc/kafka/site/08/ops.html?rev=1571376&r1=1571375&r2=1571376&view=diff
==============================================================================
--- kafka/site/08/ops.html (original)
+++ kafka/site/08/ops.html Mon Feb 24 18:10:16 2014
@@ -31,7 +31,7 @@ The most important producer configuratio
</ul>
The most important consumer configuration is the fetch size.
<p>
-All configurations are documented in the <a
href="configuration.html">configuration</a> page.
+All configurations are documented in the <a
href="#configuration">configuration</a> section.
<p>
<h4><a id="prodconfig">A Production Server Config</a></h4>
Here is our server production server configuration: