We are running 10 node kafka cluster in test setup with replication factor of 3 
and topics with min.insync.replica as 2.
Recently i noticed that few nodes halted on restart after multiple node failure 
with FATAL message:

"Halting because log truncation is not allowed for topic 1613_spam, Current 
leader 2003's latest offset 20 is less than replica 2004's latest offset 21 
(kafka.server.ReplicaFetcherThread)"
My understanding is that this can happen if there is slow replica in ISR which 
doesn't have latest committed message and high water mark. As 
min.insync.replicas is 2, write will be committed when it complete on leader 
and 1 follower. Since replica.lag.time.max.ms setting is 10000, any slow 
replica can be in ISR for last 10 sec without fetching any message. if leader 
goes down within that interval and slow follower is elected as leader, this 
will result in new leader with offset less than the follower.  Is this 
explanation correct or i am missing something? What is the best way to recover 
committed message in such situation?
 
We are running cluster with following settings.
-  replication factor  3-  min.insync.replicas is set to 2.
 -  request.required.acks -1-  unclean.leader.election.enable is set to false-  
replica.lag.time.max.ms is 10000-  
replica.high.watermark.checkpoint.interval.ms 1000


Thanks 
Avanish

Reply via email to