Hi,
I have a problem with the latency of my kafka producer under some 
circumstances.  We are running three kafka brokers on version 0.10.2.0 and 3 
zookeepers on version 3.4.8.  The server properties are below, the main 
producer property that we change is that we require acks=all, so at least 2 
will acknowledge our producer requests as we have min.insync.replicas=2.  We 
have It all runs on our own servers but in an OpenShift environment.  The 
zookeeper pods write to local storage, but the kafka broker pods write to Ceph 
storage in such a way that a kafka brokers data is kept and re-assigned to the 
same broker on restart. I am including a link of kafka producer metrics that 
highlights the problem (link is only valid for next 7 days):


https://snapshot.raintank.io/dashboard/snapshot/fjfBMC09aBQcWWzj54uCunqMYhNt4ggO
 
<https://snapshot.raintank.io/dashboard/snapshot/fjfBMC09aBQcWWzj54uCunqMYhNt4ggO>

This link has quite a lot of metrics, but the top two are about latency, 
request latency and queue time (I assume that the request latency does not 
include the time spent in the queue). 
@09:29:20, a kafka pod was restarted, the pod was the one which was the overall 
zookeeper leader elector.  This caused very large latency times for our 
messages - average is high, but we are particularly interested in the max 
latency, there was also very high queue time which is just as important to us.
@09:31:00 I had to restart our test client which is causing the load to go to 
the producer as all 14 threads had stopped since they had waited more than 5 
seconds for a producer send.
@09:34:40 I ran a manual rebalance - this hardly causes a blip in the latency.
@09:38:20 a kafka pod was restored, but this time not the one which was the 
overall zookeeper leader elector.  This caused a large latency for the requests 
and queue time.
@09:40:30 I ran a manual rebalance - again it hardly caused a blip.


What I find strange about this is that the rebalance itself seems fine, with a 
controlled shut down, the broker is supposed to do a rebalance before shutting 
down, so I would have thought everything would be off the closing broker and 
the latency of a controlled shut down would be no worse than when I do a manual 
rebalance.


Please can someone help.

Tom


Our server.properties is:

broker.id=-1                                                                    
listeners=PLAINTEXT://:9092                                                     
num.network.threads=3                                                           
num.io.threads=8                                                                
socket.send.buffer.bytes=102400                                                 
socket.receive.buffer.bytes=102400                                              
socket.request.max.bytes=104857600                                              
log.dirs=/mnt/data/logs                                                         
num.partitions=20                                                               
num.recovery.threads.per.data.dir=1                                             
log.retention.hours=168                                                         
log.segment.bytes=1073741824                                                    
log.retention.check.interval.ms=300000                                          
zookeeper.connect=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181            
zookeeper.connection.timeout.ms=6000                                            
advertised.listeners=PLAINTEXT://kafka-0:9092                                   
default.replication.factor=3                                                    
compression.type=gzip                                                           
delete.topic.enable=true                                                        
offsets.retention.minutes=10080                                                 
unclean.leader.election.enable=false                                            
min.insync.replicas=2                    
auto.leader.rebalance.enable=false                                              
leader.imbalance.check.interval.seconds=300                                     
leader.imbalance.per.broker.percentage=10                                       
                                                                                
                                                                                
inter.broker.protocol.version=0.10.2                                            
                                                                                
                                                                                
log.message.format.version=0.10.2         

Reply via email to