[ 
https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191318#comment-14191318
 ] 

Kyle Banker commented on KAFKA-1736:
------------------------------------

Thanks for all your notes on this, [~jkreps], [~nehanarkhede], and [~gwenshap]. 
It sounds sounds like there's general agreement that, if we can implement a 
good general solution, we should. I'm going to try to summarize the key 
requirements as I currently understand them and any open questions based upon 
what everyone has written.

Requirements

Each time a topic is created, the partition placement algorithm should do all 
of the following:

1. First, determine the clump size. The size of each clump must be at least the 
size of the replication factor. But the clump size should probably be larger 
when possible. This will further spread the load of reassigned leader 
partitions when a node within the clump fails.
2. If the brokers include their rack IDs, then choose each clump's members by 
lining up the racks and then selecting a broker from each rack in a round-robin 
fashion. Otherwise, simply sort brokers by ID, and choose members from this set 
in a round-robin fashion.
3. Once we have a set of clumps, it's time to assign partitions and their 
replicas.
4. Start with the first clump and the first partition. Assign each replica of 
that partition to successive members of the current clump.
5. Once finished with the first assignment, shift the members of the current 
clump. This will ensure that the next time we assign a partition to this clump, 
a different clump member will be the preferred replica.
6. Select the next clump and partition. Then repeat steps 4-5 until all 
replicas have been assigned.
7. This algorithm should ensure that all topics are evenly spread across all 
brokers.
8. The algorithm should be deterministic. If we create a second topic with the 
same number of partitions and replications, then those replicas will lay out in 
the same way as the first.
9. The hardest cases to optimize will be where the multiples of brokers to 
racks to replicas don't work out, e.g., 5 brokers, 3 racks, replication factor 
of 2.

Outstanding Questions

1. How do we determine the ideal clump size? I plan to run some simulations to 
see if I can figure out some heuristic, but if anyone has insights on this, 
please share them. Obviously, with a 3-node Kafka cluster and a replication 
factor of 3, the clump size must be 3. When cluster size is 6 and the 
replication factor is 3, I don't have a good intuition for an arrangement 
better than 2 clumps of 3. Definitely interested in hearing ideas.
2. [~jkreps] I really like the idea of coming up with a way to intelligently 
choose a clump size larger than replication factor in order to further spread 
leader load on failure, but I'd also like to better understand the scenarios 
where this is truly problematic. With brokers=3 and replication-factor=3, we 
don't have a choice. In larger clusters, we do. But if all clients are reading 
from the head of the log, then the increased load doesn't seem too great since 
all reads will be from cache. The possibility of network saturation is there 
but not too likely if the network is correctly provisioned for the workload. So 
the only case I can think of where this may be a big problem is where clients 
continually replay entire logs from beginning to end, thus frequently having to 
go to disk, etc. Give that any failure will imply a degraded cluster, I'm just 
wondering, based on your experience, how critical the larger clump size is and 
whether the trade offs implied there are worth it?
3. [~nehanarkhede] and [~gwenshap]: since there's already a patch for replica 
aware placement, what do you think would be the best way to align this work 
with that patch?

More Ideas

1. If we can come up with a good algorithm, it will be really important to 
write documentation that helps users make choices w/r/t number of brokers, 
their rack placement, replication factor, etc., that allow the partition 
placement algorithm to optimize for the best case.
2. It'd be really cool to have some sort of tool that could audit an overall 
configuration and possibly detect anomalous cases (e.g., a six node cluster 
with 5 nodes on the same rack) or cases that are hard to optimize for.

> Improve parition-broker assignment strategy for better availaility in 
> majority durability modes
> -----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1736
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1736
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.8.1.1
>            Reporter: Kyle Banker
>            Priority: Minor
>         Attachments: Partitioner.scala
>
>
> The current random strategy of partition-to-broker distribution combined with 
> a fairly typical use of min.isr and request.acks results in a suboptimal 
> level of availability.
> Specifically, if all of your topics have a replication factor of 3, and you 
> use min.isr=2 and required.acks=all, then regardless of the number of the 
> brokers in the cluster, you can safely lose only 1 node. Losing more than 1 
> node will, 95% of the time, result in the inability to write to at least one 
> partition, thus rendering the cluster unavailable. As the total number of 
> partitions increases, so does this probability.
> On the other hand, if partitions are distributed so that brokers are 
> effectively replicas of each other, then the probability of unavailability 
> when two nodes are lost is significantly decreased. This probability 
> continues to decrease as the size of the cluster increases and, more 
> significantly, this probability is constant with respect to the total number 
> of partitions. The only requirement for getting these numbers with this 
> strategy is that the number of brokers be a multiple of the replication 
> factor.
> Here are of the results of some simulations I've run:
> With Random Partition Assignment
> Number of Brokers / Number of Partitions / Replication Factor / Probability 
> that two randomly selected nodes will contain at least 1 of the same 
> partitions
> 6  / 54 / 3 / .999
> 9  / 54 / 3 / .986
> 12 / 54 / 3 / .894
> Broker-Replica-Style Partitioning
> Number of Brokers / Number of Partitions / Replication Factor / Probability 
> that two randomly selected nodes will contain at least 1 of the same 
> partitions
> 6  / 54 / 3 / .424
> 9  / 54 / 3 / .228
> 12 / 54 / 3 / .168
> Adopting this strategy will greatly increase availability for users wanting 
> majority-style durability and should not change current behavior as long as 
> leader partitions are assigned evenly. I don't know of any negative impact 
> for other use cases, as in these cases, the distribution will still be 
> effectively random.
> Let me know if you'd like to see simulation code and whether a patch would be 
> welcome.
> EDIT: Just to clarify, here's how the current partition assigner would assign 
> 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set 
> of replicas).
> 0 = Some(List(2, 3, 4))
> 1 = Some(List(3, 4, 5))
> 2 = Some(List(4, 5, 6))
> 3 = Some(List(5, 6, 7))
> 4 = Some(List(6, 7, 8))
> 5 = Some(List(7, 8, 9))
> 6 = Some(List(8, 9, 1))
> 7 = Some(List(9, 1, 2))
> 8 = Some(List(1, 2, 3))
> Here's how I'm proposing they be assigned:
> 0 = Some(ArrayBuffer(8, 5, 2))
> 1 = Some(ArrayBuffer(8, 5, 2))
> 2 = Some(ArrayBuffer(8, 5, 2))
> 3 = Some(ArrayBuffer(7, 4, 1))
> 4 = Some(ArrayBuffer(7, 4, 1))
> 5 = Some(ArrayBuffer(7, 4, 1))
> 6 = Some(ArrayBuffer(6, 3, 0))
> 7 = Some(ArrayBuffer(6, 3, 0))
> 8 = Some(ArrayBuffer(6, 3, 0))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to