[
https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187701#comment-14187701
]
Jay Kreps commented on KAFKA-1736:
----------------------------------
Yes, this is correct. This is a common tradeoff in this type of partitioned
system. The math works for availability if you lose more than min.isr or data
loss if you lose more than the replication factor. It actually also applies to
multi-tenancy problems, e.g. if you have a crazy producer overloading one
topic, how many other topics are impacted?
If you do random placement then any time you lose 3 nodes you will likely have
data loss in at least on partition. However identical node-node replication is
no panacea either. If you have identical replicas then if you lose 3 nodes you
will probably lose no partitions, however if you lose a partition you will
probably lose all partitions on that machine. I believe, though I haven't done
the math, that the expected total data loss is the same either way but in one
mode the probability of some data loss is high and the probability of
large-scale loss low; in the other extreme the probability of some data loss is
low, but the probability of total loss comparatively high.
Another problem with your proposal is that replication is set per-topic, so the
placement you describe is only possible if all topics have the same replication
factor.
However these two extremes are not the only options. A generalization of the
extremes would be divide the N machines in the cluster arbitrarily into C
clumps and attempt to place the partitions for a given topic entirely in a
single clump. If C=1 you get the current strategy, if C=N/3 you get your
strategy for replication factor 3; however any C should be possible, balances
between these extremes, and doesn't depend on a single replication factor
across the cluster.
> Improve parition-broker assignment strategy for better availaility in
> majority durability modes
> -----------------------------------------------------------------------------------------------
>
> Key: KAFKA-1736
> URL: https://issues.apache.org/jira/browse/KAFKA-1736
> Project: Kafka
> Issue Type: Improvement
> Affects Versions: 0.8.1.1
> Reporter: Kyle Banker
> Priority: Minor
>
> The current random strategy of partition-to-broker distribution combined with
> a fairly typical use of min.isr and request.acks results in a suboptimal
> level of availability.
> Specifically, if all of your topics have a replication factor of 3, and you
> use min.isr=2 and required.acks=all, then regardless of the number of the
> brokers in the cluster, you can safely lose only 1 node. Losing more than 1
> node will, 95% of the time, result in the inability to write to at least one
> partition, thus rendering the cluster unavailable. As the total number of
> partitions increases, so does this probability.
> On the other hand, if partitions are distributed so that brokers are
> effectively replicas of each other, then the probability of unavailability
> when two nodes are lost is significantly decreased. This probability
> continues to decrease as the size of the cluster increases and, more
> significantly, this probability is constant with respect to the total number
> of partitions. The only requirement for getting these numbers with this
> strategy is that the number of brokers be a multiple of the replication
> factor.
> Here are of the results of some simulations I've run:
> With Random Partition Assignment
> Number of Brokers / Number of Partitions / Replication Factor / Probability
> that two randomly selected nodes will contain at least 1 of the same
> partitions
> 6 / 54 / 3 / .999
> 9 / 54 / 3 / .986
> 12 / 54 / 3 / .894
> Broker-Replica-Style Partitioning
> Number of Brokers / Number of Partitions / Replication Factor / Probability
> that two randomly selected nodes will contain at least 1 of the same
> partitions
> 6 / 54 / 3 / .424
> 9 / 54 / 3 / .228
> 12 / 54 / 3 / .168
> Adopting this strategy will greatly increase availability for users wanting
> majority-style durability and should not change current behavior as long as
> leader partitions are assigned evenly. I don't know of any negative impact
> for other use cases, as in these cases, the distribution will still be
> effectively random.
> Let me know if you'd like to see simulation code and whether a patch would be
> welcome.
> EDIT: Just to clarify, here's how the current partition assigner would assign
> 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set
> of replicas).
> 0 = Some(List(2, 3, 4))
> 1 = Some(List(3, 4, 5))
> 2 = Some(List(4, 5, 6))
> 3 = Some(List(5, 6, 7))
> 4 = Some(List(6, 7, 8))
> 5 = Some(List(7, 8, 9))
> 6 = Some(List(8, 9, 1))
> 7 = Some(List(9, 1, 2))
> 8 = Some(List(1, 2, 3))
> Here's how I'm proposing they be assigned:
> 0 = Some(ArrayBuffer(8, 5, 2))
> 1 = Some(ArrayBuffer(8, 5, 2))
> 2 = Some(ArrayBuffer(8, 5, 2))
> 3 = Some(ArrayBuffer(7, 4, 1))
> 4 = Some(ArrayBuffer(7, 4, 1))
> 5 = Some(ArrayBuffer(7, 4, 1))
> 6 = Some(ArrayBuffer(6, 3, 0))
> 7 = Some(ArrayBuffer(6, 3, 0))
> 8 = Some(ArrayBuffer(6, 3, 0))
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)