David Buckley created KAFKA-14150:
-------------------------------------

             Summary: Allocation of initial partitions is deterministic and 
produces a leader bias when a broker is offline
                 Key: KAFKA-14150
                 URL: https://issues.apache.org/jira/browse/KAFKA-14150
             Project: Kafka
          Issue Type: Improvement
            Reporter: David Buckley


Observation of our current cluster suggests that with N brokers, the first N 
partitions are always allocated in a round-robin format with a random offset. 
The preferred leader is always the first in a given replica list (and hence is 
allocated round-robin, too). Subsequent brokers are allocated using some 
shuffle on the list, again in a round-robin, which I think is fine and doesn't 
show the bias I detail below. Suppose every topic has as many partitions as 
there are brokers and replication factor of 3. Then every topic has replicas 
{{N, N+1, N+2}} except where this wraps. Example:

* Topic A: 3 partitions, replicas {{012}}, {{120}}, {{201}}, leaders 0, 1, 2
* Topic B: 3 partitions, replicas {{120}}, {{201}}, {{012}}, leaders 1, 2, 0
* Topic C: 3 partitions, replicas {{201}}, {{012}}, {{120}}, leaders 2, 0, 1

This means that if broker {{x}} goes down, every partition that had {{x}} as 
its preferred leader now elects {{x+1}} as its leader -- the leader allocation 
were broker 1 to be offline now looks like:

* Topic A: 3 partitions, replicas {{02}}, {{20}}, {{20}}, leaders 0, 2, 2
* Topic B: 3 partitions, replicas {{20}}, {{20}}, {{02}}, leaders 2, 2, 0
* Topic C: 3 partitions, replicas {{20}}, {{02}}, {{20}}, leaders 2, 0, 2

We see that broker 2 becomes leader of 100% of the failed-over partitions, and 
is now leader of 2x as many partitions as broker 0.

If there were 6 brokers, we'd see that replica sets {{02}}, {{23}} and {{50}} 
would have reduced replication (and broker 4 isn't providing any redundancy for 
partitions replicated in broker 1) in addition to broker 2 leading 2x as many 
partitions as any other broker. Brokers 0 and 2 are now more critical than 3 
and 5, which are in turn more critical than broker 4.

I'm unclear if there's any undesirable side-effects of this, but my expectation 
is that the behaviour isn't really intended because subsequent partitions don't 
just replicate the round-robin of the first batch. Should the allocation of the 
initial partitions be completely random to avoid this bias, or is it 
inconsequential?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to