[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969090#comment-16969090
 ] 

Tim Van Laer edited comment on KAFKA-8803 at 11/7/19 9:56 AM:
--------------------------------------------------------------

I ran into the same issue. 

One stream instance (the one dealing with partition 52) kept failing with:
{code}
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. 
taskId=0_52, processor=KSTREAM-SOURCE-0000000000, topic=xyz.entries-internal.0, 
partition=52, offset=5151450, 
stacktrace=org.apache.kafka.common.errors.TimeoutException: Timeout expired 
after 60000milliseconds while awaiting InitProducerId

  at 
org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788)
 ~[timeline-aligner.jar:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired 
after 60000milliseconds while awaiting InitProducerId
{code}
It was automatically restarted every time, but it kept failing (even after 
stopping the whole group).

Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the 
client started to get into troubles. 
{code}
[2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}
{code}
[2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}

Meta:
* Kafka Streams 2.3.1, 
* Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) 

I will give the {{max.block.ms}} a shot, but we're first trying a rolling 
restart of the brokers.


was (Author: timvanlaer):
I ran into the same issue. 

One stream instance (the one dealing with partition 52) kept failing with:
{code}
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. 
taskId=0_52, processor=KSTREAM-SOURCE-0000000000, 
topic=galactica.timeline-aligner.entries-internal.0, partition=52, 
offset=5151450, stacktrace=org.apache.kafka.common.errors.TimeoutException: 
Timeout expired after 60000milliseconds while awaiting InitProducerId

  at 
org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788)
 ~[timeline-aligner.jar:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired 
after 60000milliseconds while awaiting InitProducerId
{code}
It was automatically restarted every time, but it kept failing (even after 
stopping the whole group).

Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the 
client started to get into troubles. 
{code}
[2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}
{code}
[2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}

Meta:
* Kafka Streams 2.3.1, 
* Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) 

I will give the {{max.block.ms}} a shot, but we're first trying a rolling 
restart of the brokers.

> Stream will not start due to TimeoutException: Timeout expired after 
> 60000milliseconds while awaiting InitProducerId
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-8803
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8803
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Raman Gupta
>            Priority: Major
>         Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask                : task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 60000milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to