Hi All, In our 3-node test cluster running Kafka 0.10.0, we faced this error:
FATAL [2017-07-06 07:30:42,962] kafka.server.ReplicaFetcherThread:[Logging$class:fatal:110] - [ReplicaFetcherThread-0-0] - [ReplicaFetcherThread-0-0], Halting because log truncation is not allowed for topic Topic3, Current leader 0's latest offset 41170020 is less than replica 3's latest offset 41170083 Kafka cluster is configured with: replication_factor:3, min_isr:2 and unclean_leader_election: disabled There were some machine issues where node 1 crashed out and rejoined after 30 seconds or so. Ideally, since min_isr is set to 2, another node should have take over but for some reason the isr for some of the topic partitions consisted of only node 1 just before node 1 crashed. It appears similar to issues described in: https://issues.apache.org/jira/browse/KAFKA-3861 https://issues.apache.org/jira/browse/KAFKA-3410 What I wanted to know is : (a) How to handle such errors? ISR size is dynamically determined and it is quite possible that in time of troubles, the troubled node will shrink its ISR to itself (like network disruption before crashing). (b) Is this issue addressed in any way in future Kafka versions like 0.11.0? Will https://issues.apache.org/jira/browse/KAFKA-1211 prevent this situation? -- thanks, gaurav