Github user guozhangwang commented on a diff in the pull request:
https://github.com/apache/kafka-site/pull/60#discussion_r121496464
--- Diff: 0110/design.html ---
@@ -261,15 +262,23 @@
<li>It can read the messages, process the messages, and finally save
its position. In this case there is a possibility that the consumer process
crashes after processing messages but before saving its position.
In this case when the new process takes over the first few messages it
receives will already have been processed. This corresponds to the
"at-least-once" semantics in the case of consumer failure. In many cases
messages have a primary key and so the updates are idempotent
(receiving the same message twice just overwrites a record with another copy of
itself).
- <li>So what about exactly once semantics (i.e. the thing you actually
want)? The limitation here is not actually a feature of the messaging system
but rather the need to co-ordinate the consumer's position with
+ </ol>
+ <p>
+ So what about exactly once semantics (i.e. the thing you actually
want)? The limitation here is not actually a feature of the messaging system
but rather the need to coordinate the consumer's position with
what is actually stored as output. The classic way of achieving this
would be to introduce a two-phase commit between the storage for the consumer
position and the storage of the consumers output. But this can be
handled more simply and generally by simply letting the consumer store
its offset in the same place as its output. This is better because many of the
output systems a consumer might want to write to will not
support a two-phase commit. As an example of this, our Hadoop ETL that
populates data in HDFS stores its offsets in HDFS with the data it reads so
that it is guaranteed that either data and offsets are both updated
or neither is. We follow similar patterns for many other data systems
which require these stronger semantics and for which the messages do not have a
primary key to allow for deduplication.
- </ol>
<p>
- So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling retries on the
producer and committing its offset prior to processing a batch of
- messages. Exactly-once delivery requires co-operation with the
destination storage system but Kafka provides the offset which makes
implementing this straight-forward.
+ A special case is when the output system is just another Kafka topic
(e.g. in a Kafka Streams application). Here we can leverage the new
transactional producer capabilities in 0.11.0.0 that were mentioned above.
+ Since the consumer's position is stored as a message in a topic, we
can ensure that that topic is included in the same transaction as the output
topics receiving the processed data. If the transaction is aborted,
+ the consumer's position will revert to its old value and none of the
output data will be visible to consumers. To enable this, consumers support an
"isolation level" to achieve this. In the default
+ "read_uncommitted" mode, all messages are visible to consumers even if
they were part of an aborted transaction, but in "read_committed" mode, the
consumer will only return data from transactions which were committed
+ (and any messages which were not part of any transaction).
+ <p>
+ So effectively Kafka guarantees at-least-once delivery by default, and
allows the user to implement at-most-once delivery by disabling retries on the
producer and committing its offset prior to processing a batch of
+ messages. Exactly-once delivery is supported when processing messages
between Kafka topics, such as in Kafka Streams applications. Exactly-once
delivery for other destination storage system generally requires
--- End diff --
nit: add a ref link when mentioning Kafka Streams?
https://kafka.apache.org/documentation/streams
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---