[
https://issues.apache.org/jira/browse/KAFKA-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matej Pucihar resolved KAFKA-19593.
-----------------------------------
Fix Version/s: 4.1.1
Resolution: Fixed
> Stuck __consumer_offsets partition (kafka streams app)
> ------------------------------------------------------
>
> Key: KAFKA-19593
> URL: https://issues.apache.org/jira/browse/KAFKA-19593
> Project: Kafka
> Issue Type: Bug
> Components: consumer, streams
> Affects Versions: 4.0.0
> Reporter: Matej Pucihar
> Priority: Major
> Labels: kafka-streams
> Fix For: 4.1.1
>
>
> h3. Problem Summary
> My Kafka Streams application cannot move its {{state_store}} from
> {{STARTING}} to {{{}RUNNING{}}}.
> I'm using a *Strimzi Kafka cluster* with:
> * 3 *controller nodes*
> * 4 *broker nodes*
> h3. Observations
> h4. Partition {{__consumer_offsets-35}} is {*}stuck{*}.
> From AKHQ, partition details:
> * *Broker 10* is the *leader* of {{__consumer_offsets-35}}
> * There are *no interesting logs* on broker 10
> * However, logs are *spamming every 10ms* from broker 11 (a {*}replica{*}):
> 2025-08-11 04:05:50 INFO [TxnMarkerSenderThread-11]
> TransactionMarkerRequestCompletionHandler:66
> [Transaction Marker Request Completion Handler 10]: Sending
> irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-4's
> transaction marker for partition __consumer_offsets-35 has failed with error
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
> current coordinator epoch 38
> h4. Brokers 20 and 21 — neither leaders nor replicas — also spamming the same
> error:
> *Broker 20:*
> 2025-08-11 04:39:45 INFO [TxnMarkerSenderThread-20]
> TransactionMarkerRequestCompletionHandler:66
> Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-3's
> transaction marker for partition __consumer_offsets-35 has failed with error
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
> current coordinator epoch 54
>
> *Broker 21:*
> 2025-08-11 04:39:58 INFO [TxnMarkerSenderThread-21]
> TransactionMarkerRequestCompletionHandler:66
> Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-2's
> transaction marker for partition __consumer_offsets-35 has failed with error
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
> current coordinator epoch 28
>
> ----
> h3. Kafka Streams App Behavior
> Logs from the Kafka Streams app (at debug level) repeat continuously. The
> {{state_store}} *never transitions* from {{STARTING}} to {{{}RUNNING{}}}.
> Key repeated logs (debug log level):
> * Polling main consumer repeatedly
> * SASL/SCRAM authentication succeeds
> * 0 records fetched
> * 0 records processed
> * Punctuators run, but nothing gets committed
> * Fails to commit due to {*}rebalance in progress{*}, retrying…
> {{}}
> ----
> h3. Workarounds Considered
> The *only thing that temporarily resolves the issue* is:
> * Physically deleting the partition files for {{__consumer_offsets-35}} from
> both the leader and replica brokers
> Other drastic options:
> * Deleting the entire {{__consumer_offsets}} topic
> * Re-creating the entire Kafka cluster
> ----
> h3. Additional Info
> * I cannot reproduce this in a *clean git project*
> * The issue is isolated to a {*}"corrupt" cluster{*}, which is still
> available for inspection
> * This problem has occurred *4 times* in the *past month*
> * It *started happening after upgrading from Strimzi 3.9 to 4.0*
> * I'm using quarkus (kafka-stream version is 4.0.0) with default
> configuration, the only config worth mentioning is that I'm using
> exactly_once_v2 processing guarantee.
> ----
> h3. Help Needed
> I'm hoping someone can {*}make sense of this issue{*}.
> Please feel free to *reach out.*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)