[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raman Gupta updated KAFKA-8803: ------------------------------- Description: One streams app is consistently failing at startup with the following exception: {code} 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] org.apa.kaf.str.pro.int.StreamTask : task [0_36] Timeout exception caught when initializing transactions for task 0_36. This might happen if the broker is slow to respond, if the network connection to the broker was interrupted, or if similar circumstances arise. You can increase producer parameter `max.block.ms` to increase this timeout. org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId {code} These same brokers are used by many other streams without any issue, including some in the very same processes for the stream which consistently throws this exception. *UPDATE 08/16:* The very first instance of this error is August 13th 2019, 17:03:36.754 and it happened for 4 different streams. For 3 of these streams, the error only happened once, and then the stream recovered. For the 4th stream, the error has continued to happen, and continues to happen now. I looked up the broker logs for this time, and see that at August 13th 2019, 16:47:43, two of four brokers started reporting messages like this, for multiple partitions: [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) One of these brokers only reported 2 cases, the other reported many many thousands. The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, here is a view of the count of these messages over time: !screenshot-1.png! I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 broker. The broker has a patch for KAFKA-8773. was: One streams app is consistently failing at startup with the following exception: {code} 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] org.apa.kaf.str.pro.int.StreamTask : task [0_36] Timeout exception caught when initializing transactions for task 0_36. This might happen if the broker is slow to respond, if the network connection to the broker was interrupted, or if similar circumstances arise. You can increase producer parameter `max.block.ms` to increase this timeout. org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId {code} These same brokers are used by many other streams without any issue, including some in the very same processes for the stream which consistently throws this exception. I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 broker. The broker has a patch for KAFKA-8773. > Stream will not start due to TimeoutException: Timeout expired after > 60000milliseconds while awaiting InitProducerId > -------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Reporter: Raman Gupta > Priority: Major > Attachments: screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask : task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 60000milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > One of these brokers only reported 2 cases, the other reported many many > thousands. > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian JIRA (v7.6.14#76016)