Matej Pucihar created KAFKA-19593:
-------------------------------------
Summary: Stuck __consumer_offsets partition (kafka streams app)
Key: KAFKA-19593
URL: https://issues.apache.org/jira/browse/KAFKA-19593
Project: Kafka
Issue Type: Bug
Components: consumer, streams
Affects Versions: 4.0.0
Reporter: Matej Pucihar
h3. Problem Summary
My Kafka Streams application cannot move its {{state_store}} from {{STARTING}}
to {{{}RUNNING{}}}.
I'm using a *Strimzi Kafka cluster* with:
* 3 *controller nodes*
* 4 *broker nodes*
h3. Observations
h4. Partition {{__consumer_offsets-35}} is {*}stuck{*}.
>From AKHQ, partition details:
* *Broker 10* is the *leader* of {{__consumer_offsets-35}}
* There are *no interesting logs* on broker 10
* However, logs are *spamming every 10ms* from broker 11 (a {*}replica{*}):
2025-08-11 04:05:50 INFO [TxnMarkerSenderThread-11]
TransactionMarkerRequestCompletionHandler:66
[Transaction Marker Request Completion Handler 10]: Sending
irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-4's
transaction marker for partition __consumer_offsets-35 has failed with error
org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
current coordinator epoch 38
h4. Brokers 20 and 21 — neither leaders nor replicas — also spamming the same
error:
*Broker 20:*
2025-08-11 04:39:45 INFO [TxnMarkerSenderThread-20]
TransactionMarkerRequestCompletionHandler:66
Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-3's
transaction marker for partition __consumer_offsets-35 has failed with error
org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
current coordinator epoch 54
*Broker 21:*
2025-08-11 04:39:58 INFO [TxnMarkerSenderThread-21]
TransactionMarkerRequestCompletionHandler:66
Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-2's
transaction marker for partition __consumer_offsets-35 has failed with error
org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
current coordinator epoch 28
----
h3. Kafka Streams App Behavior
Logs from the Kafka Streams app (at debug level) repeat continuously. The
{{state_store}} *never transitions* from {{STARTING}} to {{{}RUNNING{}}}.
Key repeated logs (debug log level):
* Polling main consumer repeatedly
* SASL/SCRAM authentication succeeds
* 0 records fetched
* 0 records processed
* Punctuators run, but nothing gets committed
* Fails to commit due to {*}rebalance in progress{*}, retrying…
{{}}
----
h3. Workarounds Considered
The *only thing that temporarily resolves the issue* is:
* Physically deleting the partition files for {{__consumer_offsets-35}} from
both the leader and replica brokers
Other drastic options:
* Deleting the entire {{__consumer_offsets}} topic
* Re-creating the entire Kafka cluster
----
h3. Additional Info
* I cannot reproduce this in a *clean git project*
* The issue is isolated to a {*}"corrupt" cluster{*}, which is still available
for inspection
* This problem has occurred *4 times* in the *past month*
* It *started happening after upgrading from Strimzi 3.9 to 4.0*
* I'm using quarkus (kafka-stream version is 4.0.0) with default
configuration, the only config worth mentioning is that I'm using
exactly_once_v2 processing guarantee.
----
h3. Help Needed
I'm hoping someone can {*}make sense of this issue{*}.
Please feel free to *reach out.*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)