08: configuration.html design.html introduction.html ops.html

junrao Mon, 24 Feb 2014 10:11:35 -0800

Author: junrao
Date: Mon Feb 24 18:10:16 2014
New Revision: 1571376

URL: http://svn.apache.org/r1571376
Log:
KAFKA-1269; Minor typos in documentation; patched by Evan Zacks; reviewed by 
Jun Rao


Modified:
    kafka/site/08/configuration.html
    kafka/site/08/design.html
    kafka/site/08/introduction.html
    kafka/site/08/ops.html

Modified: kafka/site/08/configuration.html
URL: 
http://svn.apache.org/viewvc/kafka/site/08/configuration.html?rev=1571376&r1=1571375&r2=1571376&view=diff
==============================================================================
--- kafka/site/08/configuration.html (original)
+++ kafka/site/08/configuration.html Mon Feb 24 18:10:16 2014
@@ -19,7 +19,7 @@ The essential configurations are the fol
     <tr>
       <td>broker.id</td>
       <td></td>
-      <td>Each broker is uniquely identified by a non-negative integer id. 
This id serves as the brokers "name", and allows the broker to be moved to a 
different host/port without confusing consumers. You can choose any number you 
like so long as it is unique.
+      <td>Each broker is uniquely identified by a non-negative integer id. 
This id serves as the brokers "name" and allows the broker to be moved to a 
different host/port without confusing consumers. You can choose any number you 
like so long as it is unique.
        </td>
     </tr>
     <tr>
@@ -42,7 +42,7 @@ Zookeeper also allows you to add a "chro
     <tr>
       <td>message.max.bytes</td>
       <td>1000000</td>
-      <td>The maximum size of a message that the server can receive. It is 
important that this property be in sync with the maximum fetch size your 
consumers use or else an unruly consumer will be able to publish messages too 
large for consumers to consume.</td>
+      <td>The maximum size of a message that the server can receive. It is 
important that this property be in sync with the maximum fetch size your 
consumers use or else an unruly producer will be able to publish messages too 
large for consumers to consume.</td>
     </tr>
     <tr>
       <td>num.network.threads</td>
@@ -94,7 +94,7 @@ Zookeeper also allows you to add a "chro
     <tr>
       <td>log.segment.bytes.per.topic</td>
       <td>""</td>
-      <td>This setting allows overriding log.segment.bytes on a per-topic 
basis</td>
+      <td>This setting allows overriding log.segment.bytes on a per-topic 
basis.</td>
     </tr>
     <tr>
       <td>log.roll.hours</td>
@@ -119,7 +119,7 @@ Zookeeper also allows you to add a "chro
     <tr>
       <td>log.retention.bytes</td>
       <td>-1</td>
-      <td>The amount of data to retain in the log for each topic-partitions. 
Note that this is the limit per-partition so multiple by the number of 
partitions to get the total data retained for the topic. Also note that if both 
log.retention.hours and log.retention.bytes are both set we delete a segment 
when either limit is exceeded.</td>
+      <td>The amount of data to retain in the log for each topic-partitions. 
Note that this is the limit per-partition so multiply by the number of 
partitions to get the total data retained for the topic. Also note that if both 
log.retention.hours and log.retention.bytes are both set we delete a segment 
when either limit is exceeded.</td>
     </tr>
     <tr>
       <td>log.retention.bytes.per.topic</td>
@@ -185,7 +185,7 @@ Zookeeper also allows you to add a "chro
     <tr>
       <td>replica.lag.time.max.ms</td>
       <td>10000</td>
-      <td>If a follower hasn't sent any fetch requests for this window of 
time, the leader will remove the follower from ISR and treat it as dead.</td>
+      <td>If a follower hasn't sent any fetch requests for this window of 
time, the leader will remove the follower from ISR (in-sync replicas) and treat 
it as dead.</td>
     </tr>
     <tr>
       <td>replica.lag.max.messages</td>
@@ -247,12 +247,12 @@ Zookeeper also allows you to add a "chro
     <tr>
       <td>zookeeper.connection.timeout.ms</td>
       <td>6000</td>
-      <td>The max time that the client waits to establish a connection to 
zookeeper.</td>
+      <td>The maximum amount of time that the client waits to establish a 
connection to zookeeper.</td>
     </tr>
     <tr>
       <td>zookeeper.sync.time.ms</td>
       <td>2000</td>
-      <td>How far a ZK follower can be behind a ZK leader</td>
+      <td>How far a ZK follower can be behind a ZK leader.</td>
     </tr>
     <tr>
       <td>controlled.shutdown.enable</td>

Modified: kafka/site/08/design.html
URL: 
http://svn.apache.org/viewvc/kafka/site/08/design.html?rev=1571376&r1=1571375&r2=1571376&view=diff
==============================================================================
--- kafka/site/08/design.html (original)
+++ kafka/site/08/design.html Mon Feb 24 18:10:16 2014
@@ -153,14 +153,14 @@ These are not the strongest possible sem
 <p>
 Not all use cases require such strong guarantees. For uses which are latency 
sensitive we allow the producer to specify the durability level it desires. If 
the producer specifies that it wants to wait on the message being committed 
this can take on the order of 10 ms. However the producer can also specify that 
it wants to perform the send completely asynchronously or that it wants to wait 
only until the leader (but not necessarily the followers) have the message.
 <p>
-Now let's describe the semantics from the point-of-view of the consumer. All 
replicas have the exact same log with the same offsets. The consumer controls 
it's position in this log. If the consumer never crashed it could just store 
this position in memory, but if the producer fails and we want this topic 
partition to be taken over by another process the new process will need to 
choose an appropriate position from which to start processing. Let's say the 
consumer reads some messages it has several options for processing the messages 
and updating its position.
+Now let's describe the semantics from the point-of-view of the consumer. All 
replicas have the exact same log with the same offsets. The consumer controls 
its position in this log. If the consumer never crashed it could just store 
this position in memory, but if the producer fails and we want this topic 
partition to be taken over by another process the new process will need to 
choose an appropriate position from which to start processing. Let's say the 
consumer reads some messages -- it has several options for processing the 
messages and updating its position.
 <ol>
   <li>It can read the messages, then save its position in the log, and finally 
process the messages. In this case there is a possibility that the consumer 
process crashes after saving its position but before saving the output of its 
message processing. In this case the process that took over processing would 
start at the saved position even though a few messages prior to that position 
had not been processed. This corresponds to "at-most-once" semantics as in the 
case of a consumer failure messages may not be processed.
   <li>It can read the messages, process the messages, and finally save its 
position. In this case there is a possibility that the consumer process crashes 
after processing messages but before saving its position. In this case when the 
new process takes over the first few messages it receives will already have 
been processed. This corresponds to the "at-least-once" semantics in the case 
of consumer failure. In many cases messages have a primary key and so the 
updates are idempotent (receiving the same message twice just overwrites a 
record with another copy of itself).
-  <li>So what about exactly once semantics (i.e. the thing you actually want)? 
The limitation here is not actually a feature of the messaging system but 
rather the need to co-ordinate the consumers position with what is actually 
stored as output. The classic way of achieving this would be to introduce a 
two-phase commit between the storage for the consumer position and the storage 
of the consumers output. But this can be handled more simply and generally by 
simply letting the consumer store its offset in the same place as its output. 
This is better because many of the output systems a consumer might want to 
write to will not support a two-phase commit. As example of this our Hadoop ETL 
that populates data in HDFS stores its offsets in HDFS with the data it reads 
so that it is guaranteed that either data and offsets are both updated or 
neither is. We follow similar patterns for many other data systems which 
require these stronger semantics and for which the messages do not have a pri
 mary key to allow for deduplication.
+  <li>So what about exactly once semantics (i.e. the thing you actually want)? 
The limitation here is not actually a feature of the messaging system but 
rather the need to co-ordinate the consumer's position with what is actually 
stored as output. The classic way of achieving this would be to introduce a 
two-phase commit between the storage for the consumer position and the storage 
of the consumers output. But this can be handled more simply and generally by 
simply letting the consumer store its offset in the same place as its output. 
This is better because many of the output systems a consumer might want to 
write to will not support a two-phase commit. As an example of this, our Hadoop 
ETL that populates data in HDFS stores its offsets in HDFS with the data it 
reads so that it is guaranteed that either data and offsets are both updated or 
neither is. We follow similar patterns for many other data systems which 
require these stronger semantics and for which the messages do not have 
 a primary key to allow for deduplication.
 </ol>
 <p>
-So effectively Kafka guarantees at-least-once delivery by default and allows 
the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of messages. 
Exactly-once delivery requires co-operation with the destination storage system 
but Kafka gives the offset which makes implementing this straight-forward.
+So effectively Kafka guarantees at-least-once delivery by default and allows 
the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of messages. 
Exactly-once delivery requires co-operation with the destination storage system 
but Kafka provides the offset which makes implementing this straight-forward.
 
 <h3><a id="replication">4.7 Replication</a></h3>
 <p>
@@ -189,7 +189,7 @@ Kafka will remain available in the prese
 
 <h4>Replicated Logs: Quorums, ISRs, and State Machines (Oh my!)</h4>
 
-At it's heart a Kafka partition is a replicated log. The replicated log is one 
of the most basic primitives in distributed data systems, and there are many 
approaches for implementing one. A replicated log can be used by other systems 
as a primitive for implementing other distributed systems in the <a 
href="http://en.wikipedia.org/wiki/State_machine_replication";>state-machine 
style</a>.
+At its heart a Kafka partition is a replicated log. The replicated log is one 
of the most basic primitives in distributed data systems, and there are many 
approaches for implementing one. A replicated log can be used by other systems 
as a primitive for implementing other distributed systems in the <a 
href="http://en.wikipedia.org/wiki/State_machine_replication";>state-machine 
style</a>.
 <p>
 A replicated log models the process of coming into consensus on the order of a 
series of values (generally numbering the log entries 0, 1, 2, ...). There are 
many ways to implement this, but the simplest and fastest is with a leader who 
chooses the ordering of values provided to it. As long as the leader remains 
alive, all followers need to only copy the values and ordering, the leader 
chooses.
 <p>

Modified: kafka/site/08/introduction.html
URL: 
http://svn.apache.org/viewvc/kafka/site/08/introduction.html?rev=1571376&r1=1571375&r2=1571376&view=diff
==============================================================================
--- kafka/site/08/introduction.html (original)
+++ kafka/site/08/introduction.html Mon Feb 24 18:10:16 2014
@@ -43,7 +43,7 @@ Each partition has one server which acts
 
 <h4>Producers</h4>
 
-Producers publish data to the topics of their choice. The producer is able to 
chose which message to assign to which partition within the topic. This can be 
done in a round-robin fashion simply to balance load or it can be done 
according to some semantic partition function (say based on some key in the 
message). More on the use of partitioning in a second.
+Producers publish data to the topics of their choice. The producer is able to 
choose which message to assign to which partition within the topic. This can be 
done in a round-robin fashion simply to balance load or it can be done 
according to some semantic partition function (say based on some key in the 
message). More on the use of partitioning in a second.
 
 <h4>Consumers</h4>
 
@@ -53,7 +53,7 @@ Consumers label themselves with a consum
 <p>
 If all the consumer instances have the same consumer group, then this works 
just like a traditional queue balancing load over the consumers.
 <p>
-If all the consumers instances have different consumer groups then this works 
like publish-subscribe and all messages are broadcast to all consumers. 
+If all the consumer instances have different consumer groups, then this works 
like publish-subscribe and all messages are broadcast to all consumers. 
 <p>
 More commonly, however, we have found that topics have a small number of 
consumer groups, one for each "logical subscriber". Each group is composed of 
many consumer instances for scalability and fault tolerance. This is nothing 
more than publish-subscribe semantics where the subscriber is cluster of 
consumers instead of a single process.
 <p>
@@ -63,9 +63,9 @@ More commonly, however, we have found th
   A two server Kafka cluster hosting four partitions (P0-P3) with two consumer 
groups. Consumer group A has two consumer instances and group B has four.
 </div>
 <p>
-Kafka has stronger ordering guarantees than a traditional messaging system too.
+Kafka has stronger ordering guarantees than a traditional messaging system, 
too.
 <p>
-A traditional queue retains messages in-order on the server, and if multiple 
consumers consume from the queue then the server hands out messages in the 
order they are stored. However although the server hands out messages in order, 
the messages are delivered asynchronously to consumers, so they may arrive out 
of order on different consumers. This effectively means the ordering of the 
messages is lost in the presence of parallel consumption. Messaging systems 
often work around this by having a notion of "exclusive consumer" that allows 
only on process to consume from a queue, but of course this means that there is 
no parallelism in processing.
+A traditional queue retains messages in-order on the server, and if multiple 
consumers consume from the queue then the server hands out messages in the 
order they are stored. However, although the server hands out messages in 
order, the messages are delivered asynchronously to consumers, so they may 
arrive out of order on different consumers. This effectively means the ordering 
of the messages is lost in the presence of parallel consumption. Messaging 
systems often work around this by having a notion of "exclusive consumer" that 
allows only one process to consume from a queue, but of course this means that 
there is no parallelism in processing.
 <p>
 Kafka does it better. By having a notion of parallelism&mdash;the 
partition&mdash;within the topics, Kafka is able to provide both ordering 
guarantees and load balancing over a pool of consumer processes. This is 
achieved by assigning the partitions in the topic to the consumers in the 
consumer group so that each partition is consumed by exactly one consumer in 
the group. By doing this we ensure that the consumer is the only reader of that 
partition and consumes the data in order. Since there are many partitions this 
still balances the load over many consumer instances. Note however that there 
cannot be more consumer instances than partitions.
 <p>
@@ -73,10 +73,10 @@ Kafka only provides a total order over m
 
 <h4>Guarantees</h4>
 
-At a high-level Kafka gives the following guarantees
+At a high-level Kafka gives the following guarantees:
 <ul>
-  <li>Messages sent by a producer to a particular topic partition will be 
appended in the order they are sent. That is if a message M1 is sent by the 
same producer as a message M2, and M1 is sent first, then M1 will have a lower 
offset then M2 and appear earlier in the log.
+  <li>Messages sent by a producer to a particular topic partition will be 
appended in the order they are sent. That is, if a message M1 is sent by the 
same producer as a message M2, and M1 is sent first, then M1 will have a lower 
offset than M2 and appear earlier in the log.
   <li>A consumer instance sees messages in the order they are stored in the 
log.
   <li>For a topic with replication factor N, we will tolerate up to N-1 server 
failures without losing any messages committed to the log.
 </ul>
-More details on these guarantees are given in the design section of the 
documentation.
\ No newline at end of file
+More details on these guarantees are given in the design section of the 
documentation.

Modified: kafka/site/08/ops.html
URL: 
http://svn.apache.org/viewvc/kafka/site/08/ops.html?rev=1571376&r1=1571375&r2=1571376&view=diff
==============================================================================
--- kafka/site/08/ops.html (original)
+++ kafka/site/08/ops.html Mon Feb 24 18:10:16 2014
@@ -31,7 +31,7 @@ The most important producer configuratio
 </ul>
 The most important consumer configuration is the fetch size.
 <p>
-All configurations are documented in the <a 
href="configuration.html">configuration</a> page.
+All configurations are documented in the <a 
href="#configuration">configuration</a> section.
 <p>
 <h4><a id="prodconfig">A Production Server Config</a></h4>
 Here is our server production server configuration:

svn commit: r1571376 - in /kafka/site/08: configuration.html design.html introduction.html ops.html

Reply via email to