[ https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189451#comment-14189451 ]
Jay Kreps commented on KAFKA-1736: ---------------------------------- [~kbanker]: 1. I do think this is a real problem and it would be nice to address there are some complexities though. 2. Another aspect I didn't mention before is load balancing in the case of failure. The leader does a bit more work than the slaves as it serves all the consumer reads as well as the reads for slaves. Reads are cheaper than writes but not free. In the current setup if a server fails leadership is randomly dispersed and should spread fairly evenly across the cluster. So in a large cluster each server will add a very small percentage of additional load due to additional leadership responsibility. However in the C=3 case you described when a server fails 50% of its leadership load will fall to each of the other two machines in that clump. This will be a nasty surprise as you may well then run out of capacity in the rest of the clump. This argues for a largish clump size. 3. I agree with Gwen that asking people to specify C will likely be a bit confusing--99% of people will just take the default so really making the default good is the important thing. I was actually thinking we would just choose C for you in some reasonable way (e.g. just set it to 5 or 10, we should be able to analyze this problem far better than most people will). I was actually just using the clumping thing to show that you don't necessarily need exact replicas to get the benefit. 4. I chatted with Neha and we could make the strategy pluggable, and this kind of reduces risk since we can then default to the current approach, but it also reduces benefit because ~0% of people will chose anything but the default. I do think this is a real problem so I think if we are going to make a change here we should make it in the default case (or ideally just have one case). I actually don't think there is a real use case for choosing the current strategy if you had a well tested version of the other approach. This does mean the bar for testing will be higher if you undertake this. 5. So including the rack awareness the algorithm will be a bit complex as we will need to first choose clumps in a way that maximally spans racks (which I think is the point of the rack aware patch). Then you will assign a partition by first choosing a clump and then choosing a starting position within this clump. The ordering of the replica list is important as Neha points out as we use the first entry as the "preferred replica" and attempt to keep leadership there when that node is available. You don't want to end up in a scenario where you have a clump of nodes (1,2,3) sharing all the same partitions but all leadership is on node. One approach would be to randomize, however what the current code does is actually a little better by explicitly going round-robin in it's assignment so it is slightly more fair than random. That would place one partition. You would then need to round-robin over clumps to place the additional partitions. Spreading a topic over multiple clumps is important because what we have observed is that in practice you get some power law distribution in topic size, and you really don't want to place a bunch of partitions from a really high volume topic all on the same machine (you want to spread them as widely as possible). 6. [~gwenshap] this would totally just be a change in the replica placement algorithm nothing major in Kafka (unlike the rack placement patch I don't think we need any more metadata for this), so although the code is tricky it should be very contained I think. > Improve parition-broker assignment strategy for better availaility in > majority durability modes > ----------------------------------------------------------------------------------------------- > > Key: KAFKA-1736 > URL: https://issues.apache.org/jira/browse/KAFKA-1736 > Project: Kafka > Issue Type: Improvement > Affects Versions: 0.8.1.1 > Reporter: Kyle Banker > Priority: Minor > Attachments: Partitioner.scala > > > The current random strategy of partition-to-broker distribution combined with > a fairly typical use of min.isr and request.acks results in a suboptimal > level of availability. > Specifically, if all of your topics have a replication factor of 3, and you > use min.isr=2 and required.acks=all, then regardless of the number of the > brokers in the cluster, you can safely lose only 1 node. Losing more than 1 > node will, 95% of the time, result in the inability to write to at least one > partition, thus rendering the cluster unavailable. As the total number of > partitions increases, so does this probability. > On the other hand, if partitions are distributed so that brokers are > effectively replicas of each other, then the probability of unavailability > when two nodes are lost is significantly decreased. This probability > continues to decrease as the size of the cluster increases and, more > significantly, this probability is constant with respect to the total number > of partitions. The only requirement for getting these numbers with this > strategy is that the number of brokers be a multiple of the replication > factor. > Here are of the results of some simulations I've run: > With Random Partition Assignment > Number of Brokers / Number of Partitions / Replication Factor / Probability > that two randomly selected nodes will contain at least 1 of the same > partitions > 6 / 54 / 3 / .999 > 9 / 54 / 3 / .986 > 12 / 54 / 3 / .894 > Broker-Replica-Style Partitioning > Number of Brokers / Number of Partitions / Replication Factor / Probability > that two randomly selected nodes will contain at least 1 of the same > partitions > 6 / 54 / 3 / .424 > 9 / 54 / 3 / .228 > 12 / 54 / 3 / .168 > Adopting this strategy will greatly increase availability for users wanting > majority-style durability and should not change current behavior as long as > leader partitions are assigned evenly. I don't know of any negative impact > for other use cases, as in these cases, the distribution will still be > effectively random. > Let me know if you'd like to see simulation code and whether a patch would be > welcome. > EDIT: Just to clarify, here's how the current partition assigner would assign > 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set > of replicas). > 0 = Some(List(2, 3, 4)) > 1 = Some(List(3, 4, 5)) > 2 = Some(List(4, 5, 6)) > 3 = Some(List(5, 6, 7)) > 4 = Some(List(6, 7, 8)) > 5 = Some(List(7, 8, 9)) > 6 = Some(List(8, 9, 1)) > 7 = Some(List(9, 1, 2)) > 8 = Some(List(1, 2, 3)) > Here's how I'm proposing they be assigned: > 0 = Some(ArrayBuffer(8, 5, 2)) > 1 = Some(ArrayBuffer(8, 5, 2)) > 2 = Some(ArrayBuffer(8, 5, 2)) > 3 = Some(ArrayBuffer(7, 4, 1)) > 4 = Some(ArrayBuffer(7, 4, 1)) > 5 = Some(ArrayBuffer(7, 4, 1)) > 6 = Some(ArrayBuffer(6, 3, 0)) > 7 = Some(ArrayBuffer(6, 3, 0)) > 8 = Some(ArrayBuffer(6, 3, 0)) -- This message was sent by Atlassian JIRA (v6.3.4#6332)