Thomas Heinze created KAFKA-13367:
-------------------------------------

             Summary: Performance Degradation during introducing Network Delay
                 Key: KAFKA-13367
                 URL: https://issues.apache.org/jira/browse/KAFKA-13367
             Project: Kafka
          Issue Type: Bug
         Environment: We are running Kafka 2.5 on m4.xlarge VMs on AWS.
            Reporter: Thomas Heinze


Hi Kafka community,

 

we are running a few chaos experiments to simulate Kafka's behaviour during 
issues in the data center. To simulate a slow network we run the following 
command on two out of six brokers (the brokers are spread across 3 AZs on AWS, 
we run the command on two brokers in the same AZ):
{code:java}
tc qdisc add dev eth0 root netem delay x ms 
 {code}
 
 At the same time we are running some Kafka producers inserting roughly 4k 
messages per second to a Kafka topic with 10 partitions with 3 replicas and 
using min-isr=2. What we observe is the following:
 * *Introducing a 1000 ms delay*: The producer see significant response time 
delays, the throughput drops to 2k per second
 * *Introducing a 2000 ms delay*: The producer delay increases further, the 
throughput drops to 300 messages per second
 * *Introducing a 5000 ms delay*: The Kafka clusters remove the slow brokers 
from the list of active replicas and the incoming messages for the remaining 
brokers increases. This is the expected behaviour imho.

What parameters would influence this behaviour? How can I make sure Kafka shows 
the behaviour like for 5 seconds even for smaller delays? We would like to make 
sure that we can guarantee around a certain throughput, even if one AZ is very 
slow.

I already tried to set "replica.lag.time.max.ms" to very small values, but I 
only observe that Kafka adds and remove the replicas on the slow nodes 
constantly from the set of ISR.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to