David Buckley created KAFKA-14150: ------------------------------------- Summary: Allocation of initial partitions is deterministic and produces a leader bias when a broker is offline Key: KAFKA-14150 URL: https://issues.apache.org/jira/browse/KAFKA-14150 Project: Kafka Issue Type: Improvement Reporter: David Buckley
Observation of our current cluster suggests that with N brokers, the first N partitions are always allocated in a round-robin format with a random offset. The preferred leader is always the first in a given replica list (and hence is allocated round-robin, too). Subsequent brokers are allocated using some shuffle on the list, again in a round-robin, which I think is fine and doesn't show the bias I detail below. Suppose every topic has as many partitions as there are brokers and replication factor of 3. Then every topic has replicas {{N, N+1, N+2}} except where this wraps. Example: * Topic A: 3 partitions, replicas {{012}}, {{120}}, {{201}}, leaders 0, 1, 2 * Topic B: 3 partitions, replicas {{120}}, {{201}}, {{012}}, leaders 1, 2, 0 * Topic C: 3 partitions, replicas {{201}}, {{012}}, {{120}}, leaders 2, 0, 1 This means that if broker {{x}} goes down, every partition that had {{x}} as its preferred leader now elects {{x+1}} as its leader -- the leader allocation were broker 1 to be offline now looks like: * Topic A: 3 partitions, replicas {{02}}, {{20}}, {{20}}, leaders 0, 2, 2 * Topic B: 3 partitions, replicas {{20}}, {{20}}, {{02}}, leaders 2, 2, 0 * Topic C: 3 partitions, replicas {{20}}, {{02}}, {{20}}, leaders 2, 0, 2 We see that broker 2 becomes leader of 100% of the failed-over partitions, and is now leader of 2x as many partitions as broker 0. If there were 6 brokers, we'd see that replica sets {{02}}, {{23}} and {{50}} would have reduced replication (and broker 4 isn't providing any redundancy for partitions replicated in broker 1) in addition to broker 2 leading 2x as many partitions as any other broker. Brokers 0 and 2 are now more critical than 3 and 5, which are in turn more critical than broker 4. I'm unclear if there's any undesirable side-effects of this, but my expectation is that the behaviour isn't really intended because subsequent partitions don't just replicate the round-robin of the first batch. Should the allocation of the initial partitions be completely random to avoid this bias, or is it inconsequential? -- This message was sent by Atlassian Jira (v8.20.10#820010)