Hi, I have a problem with the latency of my kafka producer under some circumstances. We are running three kafka brokers on version 0.10.2.0 and 3 zookeepers on version 3.4.8. The server properties are below, the main producer property that we change is that we require acks=all, so at least 2 will acknowledge our producer requests as we have min.insync.replicas=2. We have It all runs on our own servers but in an OpenShift environment. The zookeeper pods write to local storage, but the kafka broker pods write to Ceph storage in such a way that a kafka brokers data is kept and re-assigned to the same broker on restart. I am including a link of kafka producer metrics that highlights the problem (link is only valid for next 7 days):
https://snapshot.raintank.io/dashboard/snapshot/fjfBMC09aBQcWWzj54uCunqMYhNt4ggO <https://snapshot.raintank.io/dashboard/snapshot/fjfBMC09aBQcWWzj54uCunqMYhNt4ggO> This link has quite a lot of metrics, but the top two are about latency, request latency and queue time (I assume that the request latency does not include the time spent in the queue). @09:29:20, a kafka pod was restarted, the pod was the one which was the overall zookeeper leader elector. This caused very large latency times for our messages - average is high, but we are particularly interested in the max latency, there was also very high queue time which is just as important to us. @09:31:00 I had to restart our test client which is causing the load to go to the producer as all 14 threads had stopped since they had waited more than 5 seconds for a producer send. @09:34:40 I ran a manual rebalance - this hardly causes a blip in the latency. @09:38:20 a kafka pod was restored, but this time not the one which was the overall zookeeper leader elector. This caused a large latency for the requests and queue time. @09:40:30 I ran a manual rebalance - again it hardly caused a blip. What I find strange about this is that the rebalance itself seems fine, with a controlled shut down, the broker is supposed to do a rebalance before shutting down, so I would have thought everything would be off the closing broker and the latency of a controlled shut down would be no worse than when I do a manual rebalance. Please can someone help. Tom Our server.properties is: broker.id=-1 listeners=PLAINTEXT://:9092 num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=/mnt/data/logs num.partitions=20 num.recovery.threads.per.data.dir=1 log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 zookeeper.connect=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181 zookeeper.connection.timeout.ms=6000 advertised.listeners=PLAINTEXT://kafka-0:9092 default.replication.factor=3 compression.type=gzip delete.topic.enable=true offsets.retention.minutes=10080 unclean.leader.election.enable=false min.insync.replicas=2 auto.leader.rebalance.enable=false leader.imbalance.check.interval.seconds=300 leader.imbalance.per.broker.percentage=10 inter.broker.protocol.version=0.10.2 log.message.format.version=0.10.2