[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042297#comment-17042297 ]
James Cheng commented on KAFKA-4084: ------------------------------------ Hi Just wanted to chime in with some of our experience on this front. We've also encountered issues like this. We run our brokers with roughly 10k partitions per broker. We have auto.leader.rebalance.enable=true. It sometimes happens to us that a normal restart of a broker will cause our cluster to get into a bad state for 1-2 hours. During that time, what we've seen is: * Auto-leader rebalance triggers and moves a large number of leaders to/from the various brokers. * {color:#172b4d}The broker that is receiving the leaders takes a long time to assume leadership. If you do a metadata request to it, it will say it is not the leader for those partitions.{color} * {color:#172b4d}The brokers that are giving up leadership will quickly give up leadership. If you do a metadata request to them, they will say that they are not the leader for those partitions.{color} * {color:#172b4d}Clients will receive notification that leadership has transitioned. But since the clients will get their metadata from an arbitrary broker in the cluster, they will receive different metadata depending on which broker they contact. Regardless, they will trust that metadata and will attempt to fetch from the broker that they believe is leader. But at this point, all leaders are claiming that they are not the leader for partitions. So they will attempt a fetch, get an error, refetch metadata, and then try again. You get a thundering herd problem, where all clients are hammering the cluster.{color} * {color:#172b4d}Broker request queues will be busy attempting to assume leadership, do replication, as well as answer and reject to all these requests.{color} * {color:#172b4d}During this time, each broker will be reporting UnderReplicatedPartitions=0. The reason for that is, brokers will report an UnderReplicatedPartition if they are the leader for a partition, and there are followers that are not present. In this case, the brokers do not believe they are leaders for these partitions, and so will not report it as an UnderReplicatedPartition. And similarly, I believe that OfflinePartitions=0 as well. But from a user's point of view, these partitions are effectively inaccessible, because no one will serve traffic for them.{color} {color:#172b4d}The metrics we've seen during this are:{color} * {color:#172b4d} ControllerQueueSize goes to maximum{color} * {color:#172b4d} RequestHandlerIdlePercent drops to 0%{color} * {color:#172b4d} (maybe) RequestQueueSize goes to maximum{color} * {color:#172b4d} TotalTimeMs for Produce request goes into 20000ms range{color} * {color:#172b4d} TotalTimeMs for FetchFollower goes into 1000ms range (which is odd, because the default setting is 500ms){color} * {color:#172b4d} TotalTimeMs for FetchConsumer goes into 1000ms range (which is odd, because the default setting is 500ms){color} {color:#172b4d}We've also seen similar behavior when we replace the hard drive on a broker and it has to re-replicate its entire contents. We don't yet definitively understand that one, yet.{color} {color:#172b4d} {color} {color:#172b4d}We've worked with Confluent on this. The advice from Confluent seems to be our brokers are simply underpowered for our situation. From the metrics point of view, the evidence that points to this is:{color} * {color:#172b4d} Normal host CPU utilization is 45-60%{color} * {color:#172b4d} During these times, High CPU utilization of 60-70%{color} * {color:#172b4d} High ResponseSendTimeMs during these times{color} * {color:#172b4d} NetworkProcessorIdlePercent 0%{color} {color:#172b4d} {color} >From our side, we're not sure *what* specifically is the thing that we're >underpowered *for*. Is it number of partitions? Number of clients + number of >retries? Amount of network data? All The Things? We're not sure. {color:#172b4d} {color} {color:#172b4d}One idea on how to mitigate it is to set {color} {code:java} auto.leader.rebalance.enable=false{code} . And then to use an kafka-reassign-partitions.sh to move leaders just a few at a time, until leadership is rebalanced. {color:#172b4d}You can do something similar when re-populating a broker from scratch. You can remove the (empty) broker from all partitions, and then use kafka-reassign-partitions.sh to add the broker as a follower to a small number of partitions at a time. That lets you control how many partitions move at a time. kafka-reassign-partitions.sh also lets you specify throttling, so you can also use that. That lets you throttle network bandwidth. So at this point, you are in control of how *many* partitions are moving, as well as how much *network* they use. And then lastly, you can also control not just "become follower" but you can also decide when they become leaders, and how quickly.{color} {color:#172b4d} {color} {color:#172b4d}All of this is unfortunately more operational work, but it at least, puts you in control of things.{color} {color:#172b4d} {color} {color:#172b4d} {color} {color:#172b4d}Our cluster specs are:{color} * 4 vCPU, 32 GiB memory and we use st1 type EBS volumes. * 10k partitions per broker * unknown number of clients * low network bytes in/out (most partitions not receiving active traffic) * num.network.threads=8 num.io.threads=16 > automated leader rebalance causes replication downtime for clusters with too > many partitions > -------------------------------------------------------------------------------------------- > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Tom Crayford > Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)