Thomas Heinze created KAFKA-13367: ------------------------------------- Summary: Performance Degradation during introducing Network Delay Key: KAFKA-13367 URL: https://issues.apache.org/jira/browse/KAFKA-13367 Project: Kafka Issue Type: Bug Environment: We are running Kafka 2.5 on m4.xlarge VMs on AWS. Reporter: Thomas Heinze
Hi Kafka community, we are running a few chaos experiments to simulate Kafka's behaviour during issues in the data center. To simulate a slow network we run the following command on two out of six brokers (the brokers are spread across 3 AZs on AWS, we run the command on two brokers in the same AZ): {code:java} tc qdisc add dev eth0 root netem delay x ms {code} At the same time we are running some Kafka producers inserting roughly 4k messages per second to a Kafka topic with 10 partitions with 3 replicas and using min-isr=2. What we observe is the following: * *Introducing a 1000 ms delay*: The producer see significant response time delays, the throughput drops to 2k per second * *Introducing a 2000 ms delay*: The producer delay increases further, the throughput drops to 300 messages per second * *Introducing a 5000 ms delay*: The Kafka clusters remove the slow brokers from the list of active replicas and the incoming messages for the remaining brokers increases. This is the expected behaviour imho. What parameters would influence this behaviour? How can I make sure Kafka shows the behaviour like for 5 seconds even for smaller delays? We would like to make sure that we can guarantee around a certain throughput, even if one AZ is very slow. I already tried to set "replica.lag.time.max.ms" to very small values, but I only observe that Kafka adds and remove the replicas on the slow nodes constantly from the set of ISR. -- This message was sent by Atlassian Jira (v8.3.4#803005)