Hi Kyle, I added new documentation, which will hopefully help. Please take a look here: https://issues.apache.org/jira/browse/KAFKA-1555
I've heard rumors that you are very very good at documenting, so I'm looking forward to your comments. Note that I'm completely ignoring the acks>1 case since we are about to remove it. Gwen On Wed, Oct 15, 2014 at 1:21 PM, Kyle Banker <kyleban...@gmail.com> wrote: > Thanks very much for these clarifications, Gwen. > > I'd recommend modifying the following phrase describing "acks=-1": > > "This option provides the best durability, we guarantee that no messages > will be lost as long as at least one in sync replica remains." > > The "as long as at least one in sync replica remains" is such a huge > caveat. It should be noted that "acks=-1" provides no actual durability > guarantees unless min.isr is also used to specify a majority of replicas. > > In addition, I was curious if you might comment on my other recent posting > "Consistency and Availability on Node Failures" and possibly add this > scenario to the docs. With acks=-1 and min.isr=2 and a 3-replica topic in a > 12-node Kafka cluster, there's a relatively high probability that losing 2 > nodes from this cluster will result in an inability to write to the cluster. > > On Tue, Oct 14, 2014 at 4:50 PM, Gwen Shapira <gshap...@cloudera.com> wrote: > >> ack = 2 *will* throw an exception when there's only one node in ISR. >> >> The problem with ack=2 is that if you have 3 replicas and you got acks >> from 2 of them, the one replica which did not get the message can >> still be in ISR and get elected as leader, leading for a loss of the >> message. If you specify ack=3, you can't tolerate the failure of a >> single replica. Not amazing either. >> >> To makes things even worse, when specifying the number of acks you >> want, you don't always know how many replicas the topic should have, >> so its difficult to pick the correct number. >> >> acks = -1 solves that problem (since all messages need to get acked by >> all replicas), but introduces the new problem of not getting an >> exception if ISR shrank to 1 replica. >> >> Thats why the min.isr configuration was added. >> >> I hope this clarifies things :) >> I'm planning to add this to the docs in a day or two, so let me know >> if there are any additional explanations or scenarios you think we >> need to include. >> >> Gwen >> >> On Tue, Oct 14, 2014 at 12:27 PM, Scott Reynolds <sreyno...@twilio.com> >> wrote: >> > A question about 0.8.1.1 and acks. I was under the impression that >> setting >> > acks to 2 will not throw an exception when there is only one node in ISR. >> > Am I incorrect ? Thus the need for min_isr. >> > >> > On Tue, Oct 14, 2014 at 11:50 AM, Kyle Banker <kyleban...@gmail.com> >> wrote: >> > >> >> It's quite difficult to infer from the docs the exact techniques >> required >> >> to ensure consistency and durability in Kafka. I propose that we add a >> doc >> >> section detailing these techniques. I would be happy to help with this. >> >> >> >> The basic question is this: assuming that I can afford to temporarily >> halt >> >> production to Kafka, how do I ensure that no message written to Kafka is >> >> ever lost under typical failure scenarios (i.e., the loss of a single >> >> broker)? >> >> >> >> Here's my understanding of this for Kafka v0.8.1.1: >> >> >> >> 1. Create a topic with a replication factor of 3. >> >> 2. Use a sync producer and set acks to 2. (Setting acks to -1 may >> >> successfully write even in a case where the data is written only to a >> >> single node). >> >> >> >> Even with these two precautions, there's always the possibility of an >> >> "unclean leader election." Can data loss still occur in this scenario? >> Is >> >> it possible to achieve this level of durability on v0.8.1.1? >> >> >> >> In Kafka v0.8.2, in addition to the above: >> >> >> >> 3. Ensure that the triple-replicated topic also disallows unclean leader >> >> election (https://issues.apache.org/jira/browse/KAFKA-1028). >> >> >> >> 4. Set the min.isr value of the producer to 2 and acks to -1 ( >> >> https://issues.apache.org/jira/browse/KAFKA-1555). The producer will >> then >> >> throw an exception if data can't be written to 2 out of 3 nodes. >> >> >> >> In addition to producer configuration and usage, there are also >> monitoring >> >> and operations considerations for achieving durability and consistency. >> As >> >> those are rather nuanced, it'd probably be easiest to just start >> iterating >> >> on a document to flesh those out. >> >> >> >> If anyone has any advice on how to better specify this, or how to get >> >> started on improving the docs, I'm happy to help out. >> >> >>