[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042297#comment-17042297
 ] 

James Cheng commented on KAFKA-4084:
------------------------------------

Hi
 
Just wanted to chime in with some of our experience on this front.
 
We've also encountered issues like this.
We run our brokers with roughly 10k partitions per broker.
 
We have auto.leader.rebalance.enable=true. It sometimes happens to us that a 
normal restart of a broker will cause our cluster to get into a bad state for 
1-2 hours.
 
During that time, what we've seen is: * Auto-leader rebalance triggers and 
moves a large number of leaders to/from the various brokers.
 * {color:#172b4d}The broker that is receiving the leaders takes a long time to 
assume leadership. If you do a metadata request to it, it will say it is not 
the leader for those partitions.{color}
 * {color:#172b4d}The brokers that are giving up leadership will quickly give 
up leadership. If you do a metadata request to them, they will say that they 
are not the leader for those partitions.{color}
 * {color:#172b4d}Clients will receive notification that leadership has 
transitioned. But since the clients will get their metadata from an arbitrary 
broker in the cluster, they will receive different metadata depending on which 
broker they contact. Regardless, they will trust that metadata and will attempt 
to fetch from the broker that they believe is leader. But at this point, all 
leaders are claiming that they are not the leader for partitions. So they will 
attempt a fetch, get an error, refetch metadata, and then try again. You get a 
thundering herd problem, where all clients are hammering the cluster.{color}
 * {color:#172b4d}Broker request queues will be busy attempting to assume 
leadership, do replication, as well as answer and reject to all these 
requests.{color}
 * {color:#172b4d}During this time, each broker will be reporting 
UnderReplicatedPartitions=0. The reason for that is, brokers will report an 
UnderReplicatedPartition if they are the leader for a partition, and there are 
followers that are not present. In this case, the brokers do not believe they 
are leaders for these partitions, and so will not report it as an 
UnderReplicatedPartition. And similarly, I believe that OfflinePartitions=0 as 
well. But from a user's point of view, these partitions are effectively 
inaccessible, because no one will serve traffic for them.{color}

 
{color:#172b4d}The metrics we've seen during this are:{color} * {color:#172b4d} 
ControllerQueueSize goes to maximum{color}
 * {color:#172b4d} RequestHandlerIdlePercent drops to 0%{color}
 * {color:#172b4d} (maybe) RequestQueueSize goes to maximum{color}
 * {color:#172b4d} TotalTimeMs for Produce request goes into 20000ms 
range{color}
 * 
{color:#172b4d} TotalTimeMs for FetchFollower goes into 1000ms range (which is 
odd, because the default setting is 500ms){color}
 * 
{color:#172b4d} TotalTimeMs for FetchConsumer goes into 1000ms range (which is 
odd, because the default setting is 500ms){color}

 
{color:#172b4d}We've also seen similar behavior when we replace the hard drive 
on a broker and it has to re-replicate its entire contents. We don't yet 
definitively understand that one, yet.{color}
{color:#172b4d} {color}
{color:#172b4d}We've worked with Confluent on this. The advice from Confluent 
seems to be our brokers are simply underpowered for our situation. From the 
metrics point of view, the evidence that points to this is:{color} * 
{color:#172b4d} Normal host CPU utilization is 45-60%{color}
 * {color:#172b4d} During these times, High CPU utilization of 60-70%{color}
 * {color:#172b4d} High ResponseSendTimeMs during these times{color}
 * {color:#172b4d} NetworkProcessorIdlePercent 0%{color}

{color:#172b4d} {color}
>From our side, we're not sure *what* specifically is the thing that we're 
>underpowered *for*. Is it number of partitions? Number of clients + number of 
>retries? Amount of network data? All The Things? We're not sure.
{color:#172b4d} {color}
{color:#172b4d}One idea on how to mitigate it is to set {color}
{code:java}
auto.leader.rebalance.enable=false{code}
. And then to use an kafka-reassign-partitions.sh to move leaders just a few at 
a time, until leadership is rebalanced. 
 
{color:#172b4d}You can do something similar when re-populating a broker from 
scratch. You can remove the (empty) broker from all partitions, and then use 
kafka-reassign-partitions.sh to add the broker as a follower to a small number 
of partitions at a time. That lets you control how many partitions move at a 
time. kafka-reassign-partitions.sh also lets you specify throttling, so you can 
also use that. That lets you throttle network bandwidth. So at this point, you 
are in control of how *many* partitions are moving, as well as how much 
*network* they use. And then lastly, you can also control not just "become 
follower" but you can also decide when they become leaders, and how 
quickly.{color}
{color:#172b4d} {color}
{color:#172b4d}All of this is unfortunately more operational work, but it at 
least, puts you in control of things.{color}
{color:#172b4d} {color}
{color:#172b4d} {color}
{color:#172b4d}Our cluster specs are:{color} * 4 vCPU, 32 GiB memory and we use 
st1 type EBS volumes.
 * 10k partitions per broker
 * unknown number of clients
 * low network bytes in/out (most partitions not receiving active traffic)
 * num.network.threads=8
num.io.threads=16

 

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> --------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4084
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4084
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>            Reporter: Tom Crayford
>            Priority: Major
>              Labels: reliability
>             Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to