Nicholas Feinberg created KAFKA-9227:
----------------------------------------

             Summary: Broker restart/snapshot times increase after upgrade from 
1.1.0 to 2.3.1
                 Key: KAFKA-9227
                 URL: https://issues.apache.org/jira/browse/KAFKA-9227
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 2.3.1
         Environment: Ubuntu 18, EC2, d2.8xlarge
            Reporter: Nicholas Feinberg


I've been looking at upgrading my cluster from 1.1.0 to 2.3.1. While testing, 
I've noticed that shutting brokers down seems to take consistently longer on 
2.3.1. Specifically, the process of 'creating snapshots' seems to take several 
times longer than it did on 1.1.0. On a small testing setup, the time needed to 
create snapshots and shut down goes from ~20s to ~120s; with production-scale 
data, it goes from ~2min to ~30min.

The test hosts run about 384 partitions each (7 topics * 128 partitions each * 
3x replication / 7 brokers). The largest prod cluster has about 1344 
partitions/broker; the smallest and slowest has 2560.

In our largest prod cluster (16 d2.8xlarge broker cluster, 200k msg/s, 300 
MB/s), our restart cycles take about 3 minutes on 1.1.0 (counting ISR-rejoin 
time) and about 30 minutes on 2.3.1. The only other change we made between 
versions was increasing heap size from 8G to 16G.

To allow myself to roll back, I'm still using the 1.1 versions of the 
inter-broker protocol and the message format - is it possible that those could 
slow things down in 2.3.1? If not, any ideas what else could be at fault, or 
what I could do to narrow down the issue further?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to