Tom Crayford created KAFKA-4084:
-----------------------------------

             Summary: automated leader rebalance causes replication downtime 
for clusters with too many partitions
                 Key: KAFKA-4084
                 URL: https://issues.apache.org/jira/browse/KAFKA-4084
             Project: Kafka
          Issue Type: Bug
          Components: controller
    Affects Versions: 0.10.0.1, 0.10.0.0, 0.9.0.1, 0.8.2.2, 0.9.0.0
            Reporter: Tom Crayford
             Fix For: 0.10.1.0


If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
you have a cluster with many partitions, there is a severe amount of 
replication downtime following a restart. This causes 
`UnderReplicatedPartitions` to fire, and replication is paused.

This is because the current automated leader rebalance mechanism changes 
leaders for *all* imbalanced partitions at once, instead of doing it gradually. 
This effectively stops all replica fetchers in the cluster (assuming there are 
enough imbalanced partitions), and restarts them. This can take minutes on busy 
clusters, during which no replication is happening and user data is at risk. 
Clients with {{acks=-1}} also see issues at this time, because replication is 
effectively stalled.

To quote Todd Palino from the mailing list:


bq. There is an admin CLI command to trigger the preferred replica election 
manually. There is also a broker configuration “auto.leader.rebalance.enable” 
which you can set to have the broker automatically perform the PLE when needed. 
DO NOT USE THIS OPTION. There are serious performance issues when doing so, 
especially on larger clusters. It needs some development work that has not been 
fully identified yet.

This setting is extremely useful for smaller clusters, but with high partition 
counts causes the huge issues stated above.

One potential fix could be adding a new configuration for the number of 
partitions to do automated leader rebalancing for at once, and *stop* once that 
number of leader rebalances are in flight, until they're done. There may be 
better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to