[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031916#comment-17031916 ]
Evan Williams edited comment on KAFKA-4084 at 2/6/20 9:04 PM: -------------------------------------------------------------- [~junrao] I've implemented throttling now. Even on quite a high throttle, and num.replica.fetchers=1, it seems the fetcherthreads are still killing my CPU (4 vcpu / 32GB / 6GB Java heap). Any ideas why that may be ? There is a approx 1500 partitions for the broker to replicate, but still - I wonder why this is still not being restricted. Unfortunately it's still causing issues for clients, as I can see the total incoming message rate drop quite bit. num.network.threads=20 num.io.threads=40 (average idle request handlers is around 20%) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7134 kafka 20 0 38.9g 6.0g 79628 R 99.9 20.0 4:53.34 ReplicaFetcherT 7149 kafka 20 0 38.9g 6.0g 79628 R 93.8 20.0 4:54.10 ReplicaFetcherT 7145 kafka 20 0 38.9g 6.0g 79628 R 50.0 20.0 4:03.96 ReplicaFetcherT 7135 kafka 20 0 38.9g 6.0g 79628 R 43.8 20.0 4:50.38 ReplicaFetcherT After a 2nd look, it's looking more like this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15031 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:24.47 data-plane-kafk 15037 kafka 20 0 52.0g 1.8g 44940 S 13.3 6.0 0:24.53 data-plane-kafk 15038 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:24.34 data-plane-kafk 15044 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:24.23 data-plane-kafk 15047 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:24.73 data-plane-kafk 15050 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:24.13 data-plane-kafk 15052 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:24.38 data-plane-kafk 15054 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:23.72 data-plane-kafk 15060 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:24.62 data-plane-kafk 15063 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:24.04 data-plane-kafk 15075 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:38.08 data-plane-kafk 15076 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:41.41 data-plane-kafk 15077 kafka 20 0 52.0g 1.8g 44940 R 13.3 6.0 0:37.81 data-plane-kafk was (Author: blodsbror): [~junrao] I've implemented throttling now. Even on quite a high throttle, and num.replica.fetchers=1, it seems the fetcherthreads are still killing my CPU (4 vcpu / 32GB / 6GB Java heap). Any ideas why that may be ? There is a approx 1500 partitions for the broker to replicate, but still - I wonder why this is still not being restricted. Unfortunately it's still causing issues for clients, as I can see the total incoming message rate drop quite bit. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7134 kafka 20 0 38.9g 6.0g 79628 R 99.9 20.0 4:53.34 ReplicaFetcherT 7149 kafka 20 0 38.9g 6.0g 79628 R 93.8 20.0 4:54.10 ReplicaFetcherT 7145 kafka 20 0 38.9g 6.0g 79628 R 50.0 20.0 4:03.96 ReplicaFetcherT 7135 kafka 20 0 38.9g 6.0g 79628 R 43.8 20.0 4:50.38 ReplicaFetcherT > automated leader rebalance causes replication downtime for clusters with too > many partitions > -------------------------------------------------------------------------------------------- > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Tom Crayford > Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)