For Kafka with Zookeeper the recovery time is proportional to the number of partitions in the cluster. So theoretically speaking the behaviour is consistent - it will take time. Kraft based Kafka clusters (since Kafka v3.3) are much much better with clusters with a large number of partitions such as yours. This is one thing you should consider - upgrade to a newer Kraft based cluster. On the current setup you can try to optimize zookeeper for better throughput/latency.
On Fri, Aug 23, 2024 at 12:40 PM Akash Jain <akashjain0...@gmail.com> wrote: > HI Atul you use the word 'leader'. You mean the 'controller'? Or you > referring to the leader for each of the partitions? > > On Fri, Aug 23, 2024 at 7:44 AM Atul Sharma > <atul.sharma.ma...@itbhu.ac.in.invalid> wrote: > >> Hi, >> We are currently facing a prolonged leader election time, approx 2 mins, >> in >> a Kafka cluster (version 2.8.2) that is configured with Zookeeper. This >> cluster has large number of topic partitions. >> >> The issue arises during the rolling restarts of the servers in the Kafka >> cluster. >> >> This extended leader election time is causing communication issues and >> unavailability for producers and consumers as they are unable to connect >> to >> Kafka within this timeframe. Any recommendations on reducing the leader >> election time? >> >> Issue is occurring on Kafka 2.8.2 with Zookeeper >> >