Consider a 12-node Kafka cluster with a 200-parition topic having a replication factor of 3. Let's assume, in addition, that we're running Kafka v0.8.2, we've disabled unclean leader election, acks is -1, and min.isr is 2.
Now suppose we lose 2 nodes. In this case, there's a good chance that 2/3 replicas of one or more partitions will be unavailable. This means that messages assigned to those partitions will not be writable. If we're writing a large number of messages, I would expect that all producers would eventually halt. It is somewhat surprising that, if we rely on a basic durability setting, the cluster would likely be unavailable even after losing only 2 / 12 nodes. It might be useful in this scenario for the producer to be able to detect which partitions are no longer available and reroute messages that would have hashed to the unavailable partitions (as defined by our acks and min.isr settings). This way, the cluster as a whole would remain available for writes at the cost of a slightly higher load on the remaining machines. Is this limitation accurately described? Is the proposed producer functionality worth pursuing?