Repository: kafka Updated Branches: refs/heads/trunk 06d6d98be -> e972d2afd
http://git-wip-us.apache.org/repos/asf/kafka/blob/e972d2af/docs/implementation.html ---------------------------------------------------------------------- diff --git a/docs/implementation.html b/docs/implementation.html index 12846fb..c22f4cf 100644 --- a/docs/implementation.html +++ b/docs/implementation.html @@ -199,7 +199,7 @@ value length : 4 bytes value : V bytes </pre> <p> -The use of the message offset as the message id is unusual. Our original idea was to use a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker. But since a consumer must maintain an ID for each server, the global uniqueness of the GUID provides no value. Furthermore the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index structure which must be synchronized with disk, essentially requiring a full persistent random-access data structure. Thus to simplify the lookup structure we decided to use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message; this makes the lookup structure simpler, though multiple seeks per consumer request are still likely. However once we settled on a counter, the jump to directly using the offset seemed natural—both after all are monotonically increasing integers unique to a partition. Since the offs et is hidden from the consumer API this decision is ultimately an implementation detail and we went with the more efficient approach. +The use of the message offset as the message id is unusual. Our original idea was to use a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker. But since a consumer must maintain an ID for each server, the global uniqueness of the GUID provides no value. Furthermore, the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index structure which must be synchronized with disk, essentially requiring a full persistent random-access data structure. Thus to simplify the lookup structure we decided to use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message; this makes the lookup structure simpler, though multiple seeks per consumer request are still likely. However once we settled on a counter, the jump to directly using the offset seemed natural—both after all are monotonically increasing integers unique to a partition. Since the off set is hidden from the consumer API this decision is ultimately an implementation detail and we went with the more efficient approach. </p> <img src="images/kafka_log.png"> <h4><a id="impl_writes" href="#impl_writes">Writes</a></h4> http://git-wip-us.apache.org/repos/asf/kafka/blob/e972d2af/docs/introduction.html ---------------------------------------------------------------------- diff --git a/docs/introduction.html b/docs/introduction.html index 484c0e7..e32ae7b 100644 --- a/docs/introduction.html +++ b/docs/introduction.html @@ -17,9 +17,9 @@ <h3> Kafka is <i>a distributed streaming platform</i>. What exactly does that mean?</h3> <p>We think of a streaming platform as having three key capabilities:</p> <ol> - <li>It let's you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system. - <li>It let's you store streams of records in a fault-tolerant way. - <li>It let's you process streams of records as they occur. + <li>It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system. + <li>It lets you store streams of records in a fault-tolerant way. + <li>It lets you process streams of records as they occur. </ol> <p>What is Kafka good for?</p> <p>It gets used for two broad classes of application:</p> @@ -56,7 +56,7 @@ In Kafka the communication between the clients and the servers is done with a si <p> Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the <i>offset</i> that uniquely identifies each record within the partition. </p> <p> -The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem. +The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem. </p> <img class="centered" src="images/log_consumer.png" style="width:400px"> <p> @@ -124,7 +124,7 @@ More details on these guarantees are given in the design section of the document How does Kafka's notion of streams compare to a traditional enterprise messaging system? </p> <p> -Messaging traditionally has two models: <a href="http://en.wikipedia.org/wiki/Message_queue">queuing</a> and <a href="http://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern">publish-subscribe</a>. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber. +Messaging traditionally has two models: <a href="http://en.wikipedia.org/wiki/Message_queue">queuing</a> and <a href="http://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern">publish-subscribe</a>. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber. </p> <p> The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups. @@ -164,7 +164,7 @@ It isn't enough to just read, write, and store streams of data, the purpose is t In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics. </p> <p> -For example a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data. +For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data. </p> <p> It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated <a href="/documentation.html#streams">Streams API</a>. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together. http://git-wip-us.apache.org/repos/asf/kafka/blob/e972d2af/docs/ops.html ---------------------------------------------------------------------- diff --git a/docs/ops.html b/docs/ops.html index a65269a..b1f1d0c 100644 --- a/docs/ops.html +++ b/docs/ops.html @@ -129,7 +129,10 @@ Here is an example showing how to mirror a single topic (named <i>my-topic</i>) </pre> Note that we specify the list of topics with the <code>--whitelist</code> option. This option allows any regular expression using <a href="http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html">Java-style regular expressions</a>. So you could mirror two topics named <i>A</i> and <i>B</i> using <code>--whitelist 'A|B'</code>. Or you could mirror <i>all</i> topics using <code>--whitelist '*'</code>. Make sure to quote any regular expression to ensure the shell doesn't try to expand it as a file path. For convenience we allow the use of ',' instead of '|' to specify a list of topics. <p> -Sometimes it is easier to say what it is that you <i>don't</i> want. Instead of using <code>--whitelist</code> to say what you want to mirror you can use <code>--blacklist</code> to say what to exclude. This also takes a regular expression argument. However, <code>--blacklist</code> is not supported when using <code>--new.consumer</code>. +Sometimes it is easier to say what it is that you <i>don't</i> want. Instead of using <code>--whitelist</code> to say what you want +to mirror you can use <code>--blacklist</code> to say what to exclude. This also takes a regular expression argument. +However, <code>--blacklist</code> is not supported when the new consumer has been enabled (i.e. when <code>bootstrap.servers</code> +has been defined in the consumer configuration). <p> Combining mirroring with the configuration <code>auto.create.topics.enable=true</code> makes it possible to have a replica cluster that will automatically create and replicate all data in a source cluster even as new topics are added. @@ -555,7 +558,7 @@ Note that durability in Kafka does not require syncing data to disk, as a failed <p> We recommend using the default flush settings which disable application fsync entirely. This means relying on the background flush done by the OS and Kafka's own background flush. This provides the best of all worlds for most uses: no knobs to tune, great throughput and latency, and full recovery guarantees. We generally feel that the guarantees provided by replication are stronger than sync to local disk, however the paranoid still may prefer having both and application level fsync policies are still supported. <p> -The drawback of using application level flush settings is that it is less efficient in it's disk usage pattern (it gives the OS less leeway to re-order writes) and it can introduce latency as fsync in most Linux filesystems blocks writes to the file whereas the background flushing does much more granular page-level locking. +The drawback of using application level flush settings is that it is less efficient in its disk usage pattern (it gives the OS less leeway to re-order writes) and it can introduce latency as fsync in most Linux filesystems blocks writes to the file whereas the background flushing does much more granular page-level locking. <p> In general you don't need to do any low-level tuning of the filesystem, but in the next few sections we will go over some of this in case it is useful. http://git-wip-us.apache.org/repos/asf/kafka/blob/e972d2af/docs/quickstart.html ---------------------------------------------------------------------- diff --git a/docs/quickstart.html b/docs/quickstart.html index 1038bfc..972c35d 100644 --- a/docs/quickstart.html +++ b/docs/quickstart.html @@ -67,7 +67,7 @@ test <h4><a id="quickstart_send" href="#quickstart_send">Step 4: Send some messages</a></h4> -<p>Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default each line will be sent as a separate message.</p> +<p>Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default, each line will be sent as a separate message.</p> <p> Run the producer and then type a few messages into the console to send to the server.</p> @@ -119,7 +119,7 @@ config/server-2.properties: listeners=PLAINTEXT://:9094 log.dir=/tmp/kafka-logs-2 </pre> -<p>The <code>broker.id</code> property is the unique and permanent name of each node in the cluster. We have to override the port and log directory only because we are running these all on the same machine and we want to keep the brokers from all trying to register on the same port or overwrite each others data.</p> +<p>The <code>broker.id</code> property is the unique and permanent name of each node in the cluster. We have to override the port and log directory only because we are running these all on the same machine and we want to keep the brokers from all trying to register on the same port or overwrite each other's data.</p> <p> We already have Zookeeper and our single node started, so we just need to start the two new nodes: </p> @@ -197,7 +197,7 @@ java.exe java -Xmx1G -Xms1G -server -XX:+UseG1GC ... build\libs\kafka_2.10-0 Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 2 Replicas: 1,2,0 Isr: 2,0 </pre> -<p>But the messages are still be available for consumption even though the leader that took the writes originally is down:</p> +<p>But the messages are still available for consumption even though the leader that took the writes originally is down:</p> <pre> > <b>bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic my-replicated-topic</b> ... @@ -305,7 +305,7 @@ unbounded input data, it will periodically output its current state and results because it cannot know when it has processed "all" the input data. </p> <p> -We will now prepare input data to a Kafka topic, which will subsequently processed by a Kafka Streams application. +We will now prepare input data to a Kafka topic, which will subsequently be processed by a Kafka Streams application. </p> <!-- http://git-wip-us.apache.org/repos/asf/kafka/blob/e972d2af/docs/security.html ---------------------------------------------------------------------- diff --git a/docs/security.html b/docs/security.html index 2e77c93..24cd771 100644 --- a/docs/security.html +++ b/docs/security.html @@ -31,7 +31,7 @@ It's worth noting that security is optional - non-secured clusters are supported The guides below explain how to configure and use the security features in both clients and brokers. <h3><a id="security_ssl" href="#security_ssl">7.2 Encryption and Authentication using SSL</a></h3> -Apache Kafka allows clients to connect over SSL. By default SSL is disabled but can be turned on as needed. +Apache Kafka allows clients to connect over SSL. By default, SSL is disabled but can be turned on as needed. <ol> <li><h4><a id="security_ssl_key" href="#security_ssl_key">Generate SSL key and certificate for each Kafka broker</a></h4> @@ -425,7 +425,7 @@ Apache Kafka allows clients to connect over SSL. By default SSL is disabled but <ul> <li>SASL/PLAIN should be used only with SSL as transport layer to ensure that clear passwords are not transmitted on the wire without encryption.</li> <li>The default implementation of SASL/PLAIN in Kafka specifies usernames and passwords in the JAAS configuration file as shown - <a href="#security_sasl_plain_brokerconfig">here</a>. To avoid storing passwords on disk, you can plugin your own implementation of + <a href="#security_sasl_plain_brokerconfig">here</a>. To avoid storing passwords on disk, you can plug in your own implementation of <code>javax.security.auth.spi.LoginModule</code> that provides usernames and passwords from an external source. The login module implementation should provide username as the public credential and password as the private credential of the <code>Subject</code>. The default implementation <code>org.apache.kafka.common.security.plain.PlainLoginModule</code> can be used as an example.</li> @@ -616,7 +616,7 @@ Kafka Authorization management CLI can be found under bin directory with all the <li><b>Adding Acls</b><br> Suppose you want to add an acl "Principals User:Bob and User:Alice are allowed to perform Operation Read and Write on Topic Test-Topic from IP 198.51.100.0 and IP 198.51.100.1". You can do that by executing the CLI with following options: <pre>bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 --add --allow-principal User:Bob --allow-principal User:Alice --allow-host 198.51.100.0 --allow-host 198.51.100.1 --operation Read --operation Write --topic Test-topic</pre> - By default all principals that don't have an explicit acl that allows access for an operation to a resource are denied. In rare cases where an allow acl is defined that allows access to all but some principal we will have to use the --deny-principal and --deny-host option. For example, if we want to allow all users to Read from Test-topic but only deny User:BadBob from IP 198.51.100.3 we can do so using following commands: + By default, all principals that don't have an explicit acl that allows access for an operation to a resource are denied. In rare cases where an allow acl is defined that allows access to all but some principal we will have to use the --deny-principal and --deny-host option. For example, if we want to allow all users to Read from Test-topic but only deny User:BadBob from IP 198.51.100.3 we can do so using following commands: <pre>bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 --add --allow-principal User:* --allow-host * --deny-principal User:BadBob --deny-host 198.51.100.3 --operation Read --topic Test-topic</pre> Note that ``--allow-host`` and ``deny-host`` only support IP addresses (hostnames are not supported). Above examples add acls to a topic by specifying --topic [topic-name] as the resource option. Similarly user can add acls to cluster by specifying --cluster and to a consumer group by specifying --group [group-name].</li> http://git-wip-us.apache.org/repos/asf/kafka/blob/e972d2af/docs/upgrade.html ---------------------------------------------------------------------- diff --git a/docs/upgrade.html b/docs/upgrade.html index d140ec2..05b55e0 100644 --- a/docs/upgrade.html +++ b/docs/upgrade.html @@ -139,7 +139,7 @@ work with 0.10.0.x brokers. Therefore, 0.9.0.0 clients should be upgraded to 0.9 To avoid such message conversion before consumers are upgraded to 0.10.0.0, one can set log.message.format.version to 0.8.2 or 0.9.0 when upgrading the broker to 0.10.0.0. This way, the broker can still use zero-copy transfer to send the data to the old consumers. Once consumers are upgraded, one can change the message format to 0.10.0 on the broker and enjoy the new message format that includes new timestamp and improved compression. - The conversion is supported to ensure compatibility and can be useful to support a few apps that have not updated to newer clients yet, but is impractical to support all consumer traffic on even an overprovisioned cluster. Therefore it is critical to avoid the message conversion as much as possible when brokers have been upgraded but the majority of clients have not. + The conversion is supported to ensure compatibility and can be useful to support a few apps that have not updated to newer clients yet, but is impractical to support all consumer traffic on even an overprovisioned cluster. Therefore, it is critical to avoid the message conversion as much as possible when brokers have been upgraded but the majority of clients have not. </p> <p> For clients that are upgraded to 0.10.0.0, there is no performance impact. @@ -233,7 +233,7 @@ work with 0.10.0.x brokers. Therefore, 0.9.0.0 clients should be upgraded to 0.9 <li> The kafka-topics.sh script (kafka.admin.TopicCommand) now exits with non-zero exit code on failure. </li> <li> The kafka-topics.sh script (kafka.admin.TopicCommand) will now print a warning when topic names risk metric collisions due to the use of a '.' or '_' in the topic name, and error in the case of an actual collision. </li> <li> The kafka-console-producer.sh script (kafka.tools.ConsoleProducer) will use the Java producer instead of the old Scala producer be default, and users have to specify 'old-producer' to use the old producer. </li> - <li> By default all command line tools will print all logging messages to stderr instead of stdout. </li> + <li> By default, all command line tools will print all logging messages to stderr instead of stdout. </li> </ul> <h5><a id="upgrade_901_notable" href="#upgrade_901_notable">Notable changes in 0.9.0.1</a></h5>