Hi Sabit,

Thanks for reporting the issue! This is the last standing replica problem
and is being fixed in KIP-966.
You can go through the below blog to understand it in detail:

https://jack-vanlightly.com/blog/2023/8/17/kafka-kip-966-fixing-the-last-replica-standing-issue#:~:text=Rule%20number%20one%20of%20leader,behind%20they%20also%20get%20removed
.

On Mon, Aug 12, 2024 at 2:03 AM Sabit Nepal <gta0...@gmail.com> wrote:

> Hello,
>
> We experienced a network partition in our kafka cluster which left 1 broker
> unable to be reached by other brokers, but it could still reach our
> zookeeper cluster. When this occurred, a number of topic-partitions shrunk
> their ISR to just the impaired broker itself, halting progress on those
> partitions. As we had to take the broker instance offline and provision a
> replacement, the partitions were unavailable until the replacement instance
> came back up and resumed acting as the broker.
>
> However, reviewing our broker and producer settings, I'm not sure why it's
> possible for the leader to have accepted some writes that were not able to
> be replicated to the followers. Our topics use min.insync.replicas=2 and
> our producers use acks=all configuration. In this scenario, with the
> changes not being replicated to other followers, I'd expect the records to
> have failed to be written. We are however on an older version of kafka -
> 2.6.1 - so I'm curious if maybe future versions have improved the behavior
> here?
>
> Some relevant logs:
>
> [Partition my-topic-one-120 broker=7] Shrinking ISR from 7,8,9 to 7.
> Leader: (highWatermark: 82153383556, endOffset: 82153383565). Out of sync
> replicas: (brokerId: 8, endOffset: 82153383556) (brokerId: 9, endOffset:
> 82153383561).
> [Partition my-topic-one-120 broker=7] ISR updated to [7] and zkVersion
> updated to [1367]
> [ReplicaFetcher replicaId=9, leaderId=7, fetcherId=5] Error in response for
> fetch request (type=FetchRequest, replicaId=9, maxWait=500, minBytes=1,
> maxBytes=10485760, fetchData={my-topic-one-120=(fetchOffset=75987953095,
> logStartOffset=75983970457, maxBytes=1048576,
> currentLeaderEpoch=Optional[772]),
> my-topic-one-84=(fetchOffset=87734453342,
> logStartOffset=87730882175, maxBytes=1048576,
> currentLeaderEpoch=Optional[776]),
> my-topic-one-108=(fetchOffset=72037212609,
> logStartOffset=72034727231, maxBytes=1048576,
> currentLeaderEpoch=Optional[776]),
> my-topic-one-72=(fetchOffset=83006080094,
> logStartOffset=83002240584, maxBytes=1048576,
> currentLeaderEpoch=Optional[768]),
> my-topic-one-96=(fetchOffset=79250375295,
> logStartOffset=79246320254, maxBytes=1048576,
> currentLeaderEpoch=Optional[763])}, isolationLevel=READ_UNCOMMITTED,
> toForget=, metadata=(sessionId=965270777, epoch=725379656), rackId=)
> [Controller id=13 epoch=611] Controller 13 epoch 611 failed to change state
> for partition my-topic-one-120 from OnlinePartition to OnlinePartition
> kafka.common.StateChangeFailedException: Failed to elect leader for
> partition my-topic-one-120 under strategy
> ControlledShutdownPartitionLeaderElectionStrategy
> (later)
> kafka.common.StateChangeFailedException: Failed to elect leader for
> partition my-topic-one-120 under strategy
> OfflinePartitionLeaderElectionStrategy(false)
>
> Configuration for this topic:
>
> Topic: my-topic-one PartitionCount: 250 ReplicationFactor: 3 Configs:
> min.insync.replicas=2,segment.bytes=536870912,retention.ms
> =1800000,unclean.leader.election.enable=false
>
> Outside of this topic, we also had a topic with a replication factor of 5
> impacted, and also the __consumer_offsets topic which we set to an RF of 5.
>
> [Partition my-topic-two-204 broker=7] Shrinking ISR from 10,9,7,11,8 to 7.
> Leader: (highWatermark: 86218167, endOffset: 86218170). Out of sync
> replicas: (brokerId: 10, endOffset: 86218167) (brokerId: 9, endOffset:
> 86218167) (brokerId: 11, endOffset: 86218167) (brokerId: 8, endOffset:
> 86218167).
> Configuration:
> Topic: my-topic-two PartitionCount: 500 ReplicationFactor: 5 Configs:
> min.insync.replicas=2,segment.jitter.ms
> =3600000,cleanup.policy=compact,segment.bytes=1048576,
> max.compaction.lag.ms
> =9000000,min.compaction.lag.ms
> =4500000,unclean.leader.election.enable=false,
> delete.retention.ms=86400000,segment.ms=21600000
>
> [Partition __consumer_offsets-18 broker=7] Shrinking ISR from 10,9,7,11,8
> to 7. Leader: (highWatermark: 4387657484, endOffset: 4387657485). Out of
> sync replicas: (brokerId: 9, endOffset: 4387657484) (brokerId: 8,
> endOffset: 4387657484) (brokerId: 10, endOffset: 4387657484) (brokerId: 11,
> endOffset: 4387657484).
> Configuration:
> Topic: __consumer_offsets PartitionCount: 50 ReplicationFactor: 5 Configs:
>
> compression.type=producer,min.insync.replicas=2,cleanup.policy=compact,segment.bytes=104857600,unclean.leader.election.enable=false
>
> Other configurations:
> zookeeper.connection.timeout.ms=6000
> replica.lag.time.max.ms=8000
> zookeeper.session.timeout.ms=6000
> Producer request.timeout.ms=8500
> Producer linger.ms=10
> Producer delivery.timeout.ms=38510
>
> I saw a similar issue described in KAFKA-8702
> <https://issues.apache.org/jira/browse/KAFKA-8702> however I did not see a
> resolution there. Any help with this would be appreciated, thank you!
>

Reply via email to