Hi Sabit, Thanks for reporting the issue! This is the last standing replica problem and is being fixed in KIP-966. You can go through the below blog to understand it in detail:
https://jack-vanlightly.com/blog/2023/8/17/kafka-kip-966-fixing-the-last-replica-standing-issue#:~:text=Rule%20number%20one%20of%20leader,behind%20they%20also%20get%20removed . On Mon, Aug 12, 2024 at 2:03 AM Sabit Nepal <gta0...@gmail.com> wrote: > Hello, > > We experienced a network partition in our kafka cluster which left 1 broker > unable to be reached by other brokers, but it could still reach our > zookeeper cluster. When this occurred, a number of topic-partitions shrunk > their ISR to just the impaired broker itself, halting progress on those > partitions. As we had to take the broker instance offline and provision a > replacement, the partitions were unavailable until the replacement instance > came back up and resumed acting as the broker. > > However, reviewing our broker and producer settings, I'm not sure why it's > possible for the leader to have accepted some writes that were not able to > be replicated to the followers. Our topics use min.insync.replicas=2 and > our producers use acks=all configuration. In this scenario, with the > changes not being replicated to other followers, I'd expect the records to > have failed to be written. We are however on an older version of kafka - > 2.6.1 - so I'm curious if maybe future versions have improved the behavior > here? > > Some relevant logs: > > [Partition my-topic-one-120 broker=7] Shrinking ISR from 7,8,9 to 7. > Leader: (highWatermark: 82153383556, endOffset: 82153383565). Out of sync > replicas: (brokerId: 8, endOffset: 82153383556) (brokerId: 9, endOffset: > 82153383561). > [Partition my-topic-one-120 broker=7] ISR updated to [7] and zkVersion > updated to [1367] > [ReplicaFetcher replicaId=9, leaderId=7, fetcherId=5] Error in response for > fetch request (type=FetchRequest, replicaId=9, maxWait=500, minBytes=1, > maxBytes=10485760, fetchData={my-topic-one-120=(fetchOffset=75987953095, > logStartOffset=75983970457, maxBytes=1048576, > currentLeaderEpoch=Optional[772]), > my-topic-one-84=(fetchOffset=87734453342, > logStartOffset=87730882175, maxBytes=1048576, > currentLeaderEpoch=Optional[776]), > my-topic-one-108=(fetchOffset=72037212609, > logStartOffset=72034727231, maxBytes=1048576, > currentLeaderEpoch=Optional[776]), > my-topic-one-72=(fetchOffset=83006080094, > logStartOffset=83002240584, maxBytes=1048576, > currentLeaderEpoch=Optional[768]), > my-topic-one-96=(fetchOffset=79250375295, > logStartOffset=79246320254, maxBytes=1048576, > currentLeaderEpoch=Optional[763])}, isolationLevel=READ_UNCOMMITTED, > toForget=, metadata=(sessionId=965270777, epoch=725379656), rackId=) > [Controller id=13 epoch=611] Controller 13 epoch 611 failed to change state > for partition my-topic-one-120 from OnlinePartition to OnlinePartition > kafka.common.StateChangeFailedException: Failed to elect leader for > partition my-topic-one-120 under strategy > ControlledShutdownPartitionLeaderElectionStrategy > (later) > kafka.common.StateChangeFailedException: Failed to elect leader for > partition my-topic-one-120 under strategy > OfflinePartitionLeaderElectionStrategy(false) > > Configuration for this topic: > > Topic: my-topic-one PartitionCount: 250 ReplicationFactor: 3 Configs: > min.insync.replicas=2,segment.bytes=536870912,retention.ms > =1800000,unclean.leader.election.enable=false > > Outside of this topic, we also had a topic with a replication factor of 5 > impacted, and also the __consumer_offsets topic which we set to an RF of 5. > > [Partition my-topic-two-204 broker=7] Shrinking ISR from 10,9,7,11,8 to 7. > Leader: (highWatermark: 86218167, endOffset: 86218170). Out of sync > replicas: (brokerId: 10, endOffset: 86218167) (brokerId: 9, endOffset: > 86218167) (brokerId: 11, endOffset: 86218167) (brokerId: 8, endOffset: > 86218167). > Configuration: > Topic: my-topic-two PartitionCount: 500 ReplicationFactor: 5 Configs: > min.insync.replicas=2,segment.jitter.ms > =3600000,cleanup.policy=compact,segment.bytes=1048576, > max.compaction.lag.ms > =9000000,min.compaction.lag.ms > =4500000,unclean.leader.election.enable=false, > delete.retention.ms=86400000,segment.ms=21600000 > > [Partition __consumer_offsets-18 broker=7] Shrinking ISR from 10,9,7,11,8 > to 7. Leader: (highWatermark: 4387657484, endOffset: 4387657485). Out of > sync replicas: (brokerId: 9, endOffset: 4387657484) (brokerId: 8, > endOffset: 4387657484) (brokerId: 10, endOffset: 4387657484) (brokerId: 11, > endOffset: 4387657484). > Configuration: > Topic: __consumer_offsets PartitionCount: 50 ReplicationFactor: 5 Configs: > > compression.type=producer,min.insync.replicas=2,cleanup.policy=compact,segment.bytes=104857600,unclean.leader.election.enable=false > > Other configurations: > zookeeper.connection.timeout.ms=6000 > replica.lag.time.max.ms=8000 > zookeeper.session.timeout.ms=6000 > Producer request.timeout.ms=8500 > Producer linger.ms=10 > Producer delivery.timeout.ms=38510 > > I saw a similar issue described in KAFKA-8702 > <https://issues.apache.org/jira/browse/KAFKA-8702> however I did not see a > resolution there. Any help with this would be appreciated, thank you! >