Consider a 12-node Kafka cluster with a 200-parition topic having a
replication factor of 3. Let's assume, in addition, that we're running
Kafka v0.8.2, we've disabled unclean leader election, acks is -1, and
min.isr is 2.

Now suppose we lose 2 nodes. In this case, there's a good chance that 2/3
replicas of one or more partitions will be unavailable. This means that
messages assigned to those partitions will not be writable. If we're
writing a large number of messages, I would expect that all producers would
eventually halt. It is somewhat surprising that, if we rely on a basic
durability setting, the cluster would likely be unavailable even after
losing only 2 / 12 nodes.

It might be useful in this scenario for the producer to be able to detect
which partitions are no longer available and reroute messages that would
have hashed to the unavailable partitions (as defined by our acks and
min.isr settings). This way, the cluster as a whole would remain available
for writes at the cost of a slightly higher load on the remaining machines.

Is this limitation accurately described? Is the proposed producer
functionality worth pursuing?

Reply via email to