We are running 10 node kafka cluster in test setup with replication factor of 3 and topics with min.insync.replica as 2. Recently i noticed that few nodes halted on restart after multiple node failure with FATAL message:
"Halting because log truncation is not allowed for topic 1613_spam, Current leader 2003's latest offset 20 is less than replica 2004's latest offset 21 (kafka.server.ReplicaFetcherThread)" My understanding is that this can happen if there is slow replica in ISR which doesn't have latest committed message and high water mark. As min.insync.replicas is 2, write will be committed when it complete on leader and 1 follower. Since replica.lag.time.max.ms setting is 10000, any slow replica can be in ISR for last 10 sec without fetching any message. if leader goes down within that interval and slow follower is elected as leader, this will result in new leader with offset less than the follower. Is this explanation correct or i am missing something? What is the best way to recover committed message in such situation? We are running cluster with following settings. - replication factor 3- min.insync.replicas is set to 2. - request.required.acks -1- unclean.leader.election.enable is set to false- replica.lag.time.max.ms is 10000- replica.high.watermark.checkpoint.interval.ms 1000 Thanks Avanish
