[
https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kyle Banker updated KAFKA-1736:
---
Description:
The current random strategy of partition-to-broker distribution combined with a
fairly typical use of min.isr and request.acks results in a suboptimal level of
availability.
Specifically, if all of your topics have a replication factor of 3, and you use
min.isr=2 and required.acks=all, then regardless of the number of the brokers
in the cluster, you can safely lose only 1 node. Losing more than 1 node will,
95% of the time, result in the inability to write to at least one partition,
thus rendering the cluster unavailable. As the total number of partitions
increases, so does this probability.
On the other hand, if partitions are distributed so that brokers are
effectively replicas of each other, then the probability of unavailability when
two nodes are lost is significantly decreased. This probability continues to
decrease as the size of the cluster increases and, more significantly, this
probability is constant with respect to the total number of partitions. The
only requirement for getting these numbers with this strategy is that the
number of brokers be a multiple of the replication factor.
Here are of the results of some simulations I've run:
With Random Partition Assignment
Number of Brokers / Number of Partitions / Replication Factor / Probability
that two randomly selected nodes will contain at least 1 of the same partitions
6 / 54 / 3 / .999
9 / 54 / 3 / .986
12 / 54 / 3 / .894
Broker-Replica-Style Partitioning
Number of Brokers / Number of Partitions / Replication Factor / Probability
that two randomly selected nodes will contain at least 1 of the same partitions
6 / 54 / 3 / .424
9 / 54 / 3 / .228
12 / 54 / 3 / .168
Adopting this strategy will greatly increase availability for users wanting
majority-style durability and should not change current behavior as long as
leader partitions are assigned evenly. I don't know of any negative impact for
other use cases, as in these cases, the distribution will still be effectively
random.
Let me know if you'd like to see simulation code and whether a patch would be
welcome.
EDIT: Just to clarify, here's how the current partition assigner would assign 9
partitions with 3 replicas each to a 9-node cluster (broker number -> set of
replicas).
0 = Some(List(2, 3, 4))
1 = Some(List(3, 4, 5))
2 = Some(List(4, 5, 6))
3 = Some(List(5, 6, 7))
4 = Some(List(6, 7, 8))
5 = Some(List(7, 8, 9))
6 = Some(List(8, 9, 1))
7 = Some(List(9, 1, 2))
8 = Some(List(1, 2, 3))
Here's how I'm proposing they be assigned:
0 = Some(ArrayBuffer(8, 5, 2))
1 = Some(ArrayBuffer(8, 5, 2))
2 = Some(ArrayBuffer(8, 5, 2))
3 = Some(ArrayBuffer(7, 4, 1))
4 = Some(ArrayBuffer(7, 4, 1))
5 = Some(ArrayBuffer(7, 4, 1))
6 = Some(ArrayBuffer(6, 3, 0))
7 = Some(ArrayBuffer(6, 3, 0))
8 = Some(ArrayBuffer(6, 3, 0))
was:
The current random strategy of partition-to-broker distribution combined with a
fairly typical use of min.isr and request.acks results in a suboptimal level of
availability.
Specifically, if all of your topics have a replication factor of 3, and you use
min.isr=2 and required.acks=all, then regardless of the number of the brokers
in the cluster, you can safely lose only 1 node. Losing more than 1 node will,
95% of the time, result in the inability to write to at least one partition,
thus rendering the cluster unavailable. As the total number of partitions
increases, so does this probability.
On the other hand, if partitions are distributed so that brokers are
effectively replicas of each other, then the probability of unavailability when
two nodes are lost is significantly decreased. This probability continues to
decrease as the size of the cluster increases and, more significantly, this
probability is constant with respect to the total number of partitions. The
only requirement for getting these numbers with this strategy is that the
number of brokers be a multiple of the replication factor.
Here are of the results of some simulations I've run:
With Random Partition Assignment
Number of Brokers / Number of Partitions / Replication Factor / Probability
that two randomly selected nodes will contain at least 1 of the same partitions
6 / 54 / 3 / .999
9 / 54 / 3 / .986
12 / 54 / 3 / .894
Broker-Replica-Style Partitioning
Number of Brokers / Number of Partitions / Replication Factor / Probability
that two randomly selected nodes will contain at least 1 of the same partitions
6 / 54 / 3 / .424
9 / 54 / 3 / .228
12 / 54 / 3 / .168
Adopting this strategy will greatly increase availability for users wanting
majority-style durability and should not change current behavior as long as
leader partitions are assigned evenly. I don't know of any negative impact for
other use cases, as in these cases, the distribut