[ 
https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189451#comment-14189451
 ] 

Jay Kreps commented on KAFKA-1736:
----------------------------------

[~kbanker]:
1. I do think this is a real problem and it would be nice to address there are 
some complexities though.
2. Another aspect I didn't mention before is load balancing in the case of 
failure. The leader does a bit more work than the slaves as it serves all the 
consumer reads as well as the reads for slaves. Reads are cheaper than writes 
but not free. In the current setup if a server fails leadership is randomly 
dispersed and should spread fairly evenly across the cluster. So in a large 
cluster each server will add a very small percentage of additional load due to 
additional leadership responsibility. However in the C=3 case you described 
when a server fails 50% of its leadership load will fall to each of the other 
two machines in that clump. This will be a nasty surprise as you may well then 
run out of capacity in the rest of the clump. This argues for a largish clump 
size.
3. I agree with Gwen that asking people to specify C will likely be a bit 
confusing--99% of people will just take the default so really making the 
default good is the important thing. I was actually thinking we would just 
choose C for you in some reasonable way (e.g. just set it to 5 or 10, we should 
be able to analyze this problem far better than most people will). I was 
actually just using the clumping thing to show that you don't necessarily need 
exact replicas to get the benefit.
4. I chatted with Neha and we could make the strategy pluggable, and this kind 
of reduces risk since we can then default to the current approach, but it also 
reduces benefit because ~0% of people will chose anything but the default. I do 
think this is a real problem so I think if we are going to make a change here 
we should make it in the default case (or ideally just have one case). I 
actually don't think there is a real use case for choosing the current strategy 
if you had a well tested version of the other approach. This does mean the bar 
for testing will be higher if you undertake this.
5. So including the rack awareness the algorithm will be a bit complex as we 
will need to first choose clumps in a way that maximally spans racks (which I 
think is the point of the rack aware patch). Then you will assign a partition 
by first choosing a clump and then choosing a starting position within this 
clump. The ordering of the replica list is important as Neha points out as we 
use the first entry as the "preferred replica" and attempt to keep leadership 
there when that node is available. You don't want to end up in a scenario where 
you have a clump of nodes (1,2,3) sharing all the same partitions but all 
leadership is on node. One approach would be to randomize, however what the 
current code does is actually a little better by explicitly going round-robin 
in it's assignment so it is slightly more fair than random. That would place 
one partition. You would then need to round-robin over clumps to place the 
additional partitions. Spreading a topic over multiple clumps is important 
because what we have observed is that in practice you get some power law 
distribution in topic size, and you really don't want to place a bunch of 
partitions from a really high volume topic all on the same machine (you want to 
spread them as widely as possible).
6. [~gwenshap] this would totally just be a change in the replica placement 
algorithm nothing major in Kafka (unlike the rack placement patch I don't think 
we need any more metadata for this), so although the code is tricky it should 
be very contained I think.

> Improve parition-broker assignment strategy for better availaility in 
> majority durability modes
> -----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1736
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1736
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.8.1.1
>            Reporter: Kyle Banker
>            Priority: Minor
>         Attachments: Partitioner.scala
>
>
> The current random strategy of partition-to-broker distribution combined with 
> a fairly typical use of min.isr and request.acks results in a suboptimal 
> level of availability.
> Specifically, if all of your topics have a replication factor of 3, and you 
> use min.isr=2 and required.acks=all, then regardless of the number of the 
> brokers in the cluster, you can safely lose only 1 node. Losing more than 1 
> node will, 95% of the time, result in the inability to write to at least one 
> partition, thus rendering the cluster unavailable. As the total number of 
> partitions increases, so does this probability.
> On the other hand, if partitions are distributed so that brokers are 
> effectively replicas of each other, then the probability of unavailability 
> when two nodes are lost is significantly decreased. This probability 
> continues to decrease as the size of the cluster increases and, more 
> significantly, this probability is constant with respect to the total number 
> of partitions. The only requirement for getting these numbers with this 
> strategy is that the number of brokers be a multiple of the replication 
> factor.
> Here are of the results of some simulations I've run:
> With Random Partition Assignment
> Number of Brokers / Number of Partitions / Replication Factor / Probability 
> that two randomly selected nodes will contain at least 1 of the same 
> partitions
> 6  / 54 / 3 / .999
> 9  / 54 / 3 / .986
> 12 / 54 / 3 / .894
> Broker-Replica-Style Partitioning
> Number of Brokers / Number of Partitions / Replication Factor / Probability 
> that two randomly selected nodes will contain at least 1 of the same 
> partitions
> 6  / 54 / 3 / .424
> 9  / 54 / 3 / .228
> 12 / 54 / 3 / .168
> Adopting this strategy will greatly increase availability for users wanting 
> majority-style durability and should not change current behavior as long as 
> leader partitions are assigned evenly. I don't know of any negative impact 
> for other use cases, as in these cases, the distribution will still be 
> effectively random.
> Let me know if you'd like to see simulation code and whether a patch would be 
> welcome.
> EDIT: Just to clarify, here's how the current partition assigner would assign 
> 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set 
> of replicas).
> 0 = Some(List(2, 3, 4))
> 1 = Some(List(3, 4, 5))
> 2 = Some(List(4, 5, 6))
> 3 = Some(List(5, 6, 7))
> 4 = Some(List(6, 7, 8))
> 5 = Some(List(7, 8, 9))
> 6 = Some(List(8, 9, 1))
> 7 = Some(List(9, 1, 2))
> 8 = Some(List(1, 2, 3))
> Here's how I'm proposing they be assigned:
> 0 = Some(ArrayBuffer(8, 5, 2))
> 1 = Some(ArrayBuffer(8, 5, 2))
> 2 = Some(ArrayBuffer(8, 5, 2))
> 3 = Some(ArrayBuffer(7, 4, 1))
> 4 = Some(ArrayBuffer(7, 4, 1))
> 5 = Some(ArrayBuffer(7, 4, 1))
> 6 = Some(ArrayBuffer(6, 3, 0))
> 7 = Some(ArrayBuffer(6, 3, 0))
> 8 = Some(ArrayBuffer(6, 3, 0))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to