[ https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191318#comment-14191318 ]
Kyle Banker commented on KAFKA-1736: ------------------------------------ Thanks for all your notes on this, [~jkreps], [~nehanarkhede], and [~gwenshap]. It sounds sounds like there's general agreement that, if we can implement a good general solution, we should. I'm going to try to summarize the key requirements as I currently understand them and any open questions based upon what everyone has written. Requirements Each time a topic is created, the partition placement algorithm should do all of the following: 1. First, determine the clump size. The size of each clump must be at least the size of the replication factor. But the clump size should probably be larger when possible. This will further spread the load of reassigned leader partitions when a node within the clump fails. 2. If the brokers include their rack IDs, then choose each clump's members by lining up the racks and then selecting a broker from each rack in a round-robin fashion. Otherwise, simply sort brokers by ID, and choose members from this set in a round-robin fashion. 3. Once we have a set of clumps, it's time to assign partitions and their replicas. 4. Start with the first clump and the first partition. Assign each replica of that partition to successive members of the current clump. 5. Once finished with the first assignment, shift the members of the current clump. This will ensure that the next time we assign a partition to this clump, a different clump member will be the preferred replica. 6. Select the next clump and partition. Then repeat steps 4-5 until all replicas have been assigned. 7. This algorithm should ensure that all topics are evenly spread across all brokers. 8. The algorithm should be deterministic. If we create a second topic with the same number of partitions and replications, then those replicas will lay out in the same way as the first. 9. The hardest cases to optimize will be where the multiples of brokers to racks to replicas don't work out, e.g., 5 brokers, 3 racks, replication factor of 2. Outstanding Questions 1. How do we determine the ideal clump size? I plan to run some simulations to see if I can figure out some heuristic, but if anyone has insights on this, please share them. Obviously, with a 3-node Kafka cluster and a replication factor of 3, the clump size must be 3. When cluster size is 6 and the replication factor is 3, I don't have a good intuition for an arrangement better than 2 clumps of 3. Definitely interested in hearing ideas. 2. [~jkreps] I really like the idea of coming up with a way to intelligently choose a clump size larger than replication factor in order to further spread leader load on failure, but I'd also like to better understand the scenarios where this is truly problematic. With brokers=3 and replication-factor=3, we don't have a choice. In larger clusters, we do. But if all clients are reading from the head of the log, then the increased load doesn't seem too great since all reads will be from cache. The possibility of network saturation is there but not too likely if the network is correctly provisioned for the workload. So the only case I can think of where this may be a big problem is where clients continually replay entire logs from beginning to end, thus frequently having to go to disk, etc. Give that any failure will imply a degraded cluster, I'm just wondering, based on your experience, how critical the larger clump size is and whether the trade offs implied there are worth it? 3. [~nehanarkhede] and [~gwenshap]: since there's already a patch for replica aware placement, what do you think would be the best way to align this work with that patch? More Ideas 1. If we can come up with a good algorithm, it will be really important to write documentation that helps users make choices w/r/t number of brokers, their rack placement, replication factor, etc., that allow the partition placement algorithm to optimize for the best case. 2. It'd be really cool to have some sort of tool that could audit an overall configuration and possibly detect anomalous cases (e.g., a six node cluster with 5 nodes on the same rack) or cases that are hard to optimize for. > Improve parition-broker assignment strategy for better availaility in > majority durability modes > ----------------------------------------------------------------------------------------------- > > Key: KAFKA-1736 > URL: https://issues.apache.org/jira/browse/KAFKA-1736 > Project: Kafka > Issue Type: Improvement > Affects Versions: 0.8.1.1 > Reporter: Kyle Banker > Priority: Minor > Attachments: Partitioner.scala > > > The current random strategy of partition-to-broker distribution combined with > a fairly typical use of min.isr and request.acks results in a suboptimal > level of availability. > Specifically, if all of your topics have a replication factor of 3, and you > use min.isr=2 and required.acks=all, then regardless of the number of the > brokers in the cluster, you can safely lose only 1 node. Losing more than 1 > node will, 95% of the time, result in the inability to write to at least one > partition, thus rendering the cluster unavailable. As the total number of > partitions increases, so does this probability. > On the other hand, if partitions are distributed so that brokers are > effectively replicas of each other, then the probability of unavailability > when two nodes are lost is significantly decreased. This probability > continues to decrease as the size of the cluster increases and, more > significantly, this probability is constant with respect to the total number > of partitions. The only requirement for getting these numbers with this > strategy is that the number of brokers be a multiple of the replication > factor. > Here are of the results of some simulations I've run: > With Random Partition Assignment > Number of Brokers / Number of Partitions / Replication Factor / Probability > that two randomly selected nodes will contain at least 1 of the same > partitions > 6 / 54 / 3 / .999 > 9 / 54 / 3 / .986 > 12 / 54 / 3 / .894 > Broker-Replica-Style Partitioning > Number of Brokers / Number of Partitions / Replication Factor / Probability > that two randomly selected nodes will contain at least 1 of the same > partitions > 6 / 54 / 3 / .424 > 9 / 54 / 3 / .228 > 12 / 54 / 3 / .168 > Adopting this strategy will greatly increase availability for users wanting > majority-style durability and should not change current behavior as long as > leader partitions are assigned evenly. I don't know of any negative impact > for other use cases, as in these cases, the distribution will still be > effectively random. > Let me know if you'd like to see simulation code and whether a patch would be > welcome. > EDIT: Just to clarify, here's how the current partition assigner would assign > 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set > of replicas). > 0 = Some(List(2, 3, 4)) > 1 = Some(List(3, 4, 5)) > 2 = Some(List(4, 5, 6)) > 3 = Some(List(5, 6, 7)) > 4 = Some(List(6, 7, 8)) > 5 = Some(List(7, 8, 9)) > 6 = Some(List(8, 9, 1)) > 7 = Some(List(9, 1, 2)) > 8 = Some(List(1, 2, 3)) > Here's how I'm proposing they be assigned: > 0 = Some(ArrayBuffer(8, 5, 2)) > 1 = Some(ArrayBuffer(8, 5, 2)) > 2 = Some(ArrayBuffer(8, 5, 2)) > 3 = Some(ArrayBuffer(7, 4, 1)) > 4 = Some(ArrayBuffer(7, 4, 1)) > 5 = Some(ArrayBuffer(7, 4, 1)) > 6 = Some(ArrayBuffer(6, 3, 0)) > 7 = Some(ArrayBuffer(6, 3, 0)) > 8 = Some(ArrayBuffer(6, 3, 0)) -- This message was sent by Atlassian JIRA (v6.3.4#6332)