Hi,

I have an unusual situation where I have a cluster running Kafka 3.5.1 in
strimzi where 4 of the __consumer_offset partitions have dropped under min
isr.

Everything else appears to be working fine.
Upon investigating, i've found that the partition followers appear to be
out of sync with the leader in terms of leader epoch

For example the leader-epoch-checkpoint file on the leader partition is
0
4
0 0
1 4
4 6
27 10

while the followers are
0
5
0 0
1 4
4 6
5 7
6 9

which appears to me like the followers are 2 elections ahead of the leader
and i'm not sure how they got to this situation.
I've attempted to force a new leader election via kafka-leader-elections
but it refused for both PREFERRED and UNCLEAN.
I've also tried a manual partition assignment to move the leader to another
broker but it wont do it.

What is even more strange is that if i watch the leader-epoch-checkpoint
file on one of the followers I can see it constantly changing as it tries
to sort itself out.
[kafka@internal-001-kafka-0 __consumer_offsets-18]$ cat
leader-epoch-checkpoint
0
3
0 0
1 4
4 6
[kafka@internal-001-kafka-0 __consumer_offsets-18]$ cat
leader-epoch-checkpoint
0
5
0 0
1 4
4 6
5 7
6 9

I have tried to manually remove the followers partition files on disk in an
attempt to get it to sync from the leader but it keeps returning to the
inconsistent state.

Restarting the broker with the partition leader on it doesn't seem to move
leadership either.

The follower keeps logging the following constantly
2024-03-19 09:23:11,169 INFO [ReplicaFetcher replicaId=2, leaderId=1,
fetcherId=0] Truncating partition __consumer_offsets-18 with
TruncationState(offset=7, completed=true) due to leader epoch and offset
EpochEndOffset(errorCode=0, partition=18, leaderEpoch=4, endOffset=10)
(kafka.server.ReplicaFetcherThread) [ReplicaFetcherThread-0-1]
2024-03-19 09:23:11,169 INFO [UnifiedLog partition=__consumer_offsets-18,
dir=/var/lib/kafka/data-0/kafka-log2] Truncating to offset 7
(kafka.log.UnifiedLog) [ReplicaFetcherThread-0-1]
2024-03-19 09:23:11,174 INFO [UnifiedLog partition=__consumer_offsets-18,
dir=/var/lib/kafka/data-0/kafka-log2] Loading producer state till offset 7
with message format version 2 (kafka.log.UnifiedLog$)
[ReplicaFetcherThread-0-1]
2024-03-19 09:23:11,174 INFO [UnifiedLog partition=__consumer_offsets-18,
dir=/var/lib/kafka/data-0/kafka-log2] Reloading from producer snapshot and
rebuilding producer state from offset 7 (kafka.log.UnifiedLog$)
[ReplicaFetcherThread-0-1]
2024-03-19 09:23:11,174 INFO [ProducerStateManager
partition=__consumer_offsets-18]Loading producer state from snapshot file
'SnapshotFile(offset=7,
file=/var/lib/kafka/data-0/kafka-log2/__consumer_offsets-18/00000000000000000007.snapshot)'
(org.apache.kafka.storage.internals.log.ProducerStateManager)
[ReplicaFetcherThread-0-1]
2024-03-19 09:23:11,175 INFO [UnifiedLog partition=__consumer_offsets-18,
dir=/var/lib/kafka/data-0/kafka-log2] Producer state recovery took 1ms for
snapshot load and 0ms for segment recovery from offset 7
(kafka.log.UnifiedLog$) [ReplicaFetcherThread-0-1]
2024-03-19 09:23:11,175 WARN [UnifiedLog partition=__consumer_offsets-18,
dir=/var/lib/kafka/data-0/kafka-log2] Non-monotonic update of high
watermark from (offset=10segment=[0:4083]) to (offset=7segment=[0:3607])
(kafka.log.UnifiedLog) [ReplicaFetcherThread-0-1]

Any ideas of how to look at this further?
Thanks
Karl

-- 



--

The information contained in this electronic message and any 
attachments to this message are intended for the exclusive use of the 
addressee(s) and may contain proprietary, confidential or privileged 
information. If you are not the intended recipient, you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately and destroy all copies of this message and any attachments. 
WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

Reply via email to