[ https://issues.apache.org/jira/browse/KAFKA-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859041#comment-16859041 ]
Mike Mintz commented on KAFKA-8477: ----------------------------------- Hi [~mumrah], unfortunately I'm unable to reproduce this again today for some reason. And the last time I reproduced this, I only had INFO-level logging in the brokers, and there was nothing in the logs other than some logging from CruiseControlMetricsReporter and TransactionCoordinator. It's possible it depends on some traffic pattern to reproduce, so I will try again early next week to see if I can get it to happen again, and hopefully find something interesting in the TRACE logs for the brokers. > Cannot consume and request metadata for __consumer_offsets topic in Kafka 2.2 > ----------------------------------------------------------------------------- > > Key: KAFKA-8477 > URL: https://issues.apache.org/jira/browse/KAFKA-8477 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 2.2.0 > Reporter: Mike Mintz > Assignee: David Arthur > Priority: Major > Attachments: kafka-2.2.0-consumer-offset-metadata-bug-master.zip, > logs.txt > > > We have an application that consumes from the __consumer_offsets topic to > report lag metrics. When we upgraded its kafka-clients dependency from 2.0.1 > to 2.2.0, it crashed with: > {noformat} > Exception in thread "main" org.apache.kafka.common.errors.TimeoutException: > Failed to get offsets by times in 30001ms > {noformat} > I created a minimal reproduction at > [https://github.com/mikemintz/kafka-2.2.0-consumer-offset-metadata-bug] and > I'm uploading a zip of this code for posterity. > In particular, the behavior happens when I call KafkaConsumer.assign(), then > poll(), then endOffsets(). This behavior only happens for the > __consumer_offsets topic. It also only happens on the Kafka cluster that we > run in production, which runs Kafka 2.2.0. The error does not occur on a > freshly created Kafka cluster, and I can't get it to reproduce with > EmbeddedKafka. > It works fine with both Kafka 2.0.1 and Kafka 2.1.1, so something broke > between 2.1.1. and 2.2.0. Based on the 2.2.0 changelog and the client log > messages (attached), it looks like it may have been introduced in KAFKA-7738 > (cc [~mumrah]). It gets in a loop, repeating the following block of log > messages: > {noformat} > 2019-06-03 23:24:15 DEBUG NetworkClient:1073 - [Consumer > clientId=test.mikemintz.lag-tracker-reproduce, > groupId=test.mikemintz.lag-tracker-reproduce] Sending metadata request > (type=MetadataRequest, topics=__consumer_offsets) to node REDACTED:9094 (id: > 2134 rack: us-west-2b) > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5862 > to 5862 for partition __consumer_offsets-0 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6040 > to 6040 for partition __consumer_offsets-10 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6008 > to 6008 for partition __consumer_offsets-20 > 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 6153 > with new epoch 6152 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5652 > to 5652 for partition __consumer_offsets-30 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6081 > to 6081 for partition __consumer_offsets-39 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5629 > to 5629 for partition __consumer_offsets-9 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5983 > to 5983 for partition __consumer_offsets-11 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5896 > to 5896 for partition __consumer_offsets-31 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5278 > to 5278 for partition __consumer_offsets-13 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6026 > to 6026 for partition __consumer_offsets-18 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5608 > to 5608 for partition __consumer_offsets-22 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6025 > to 6025 for partition __consumer_offsets-32 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5685 > to 5685 for partition __consumer_offsets-8 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5730 > to 5730 for partition __consumer_offsets-43 > 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5957 > with new epoch 5956 > 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 6047 > with new epoch 6046 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6090 > to 6090 for partition __consumer_offsets-1 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5821 > to 5821 for partition __consumer_offsets-6 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5909 > to 5909 for partition __consumer_offsets-41 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5805 > to 5805 for partition __consumer_offsets-27 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5670 > to 5670 for partition __consumer_offsets-48 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6220 > to 6220 for partition __consumer_offsets-5 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5596 > to 5596 for partition __consumer_offsets-15 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5896 > to 5896 for partition __consumer_offsets-35 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5858 > to 5858 for partition __consumer_offsets-25 > 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5927 > with new epoch 5926 > 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5938 > with new epoch 5937 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5792 > to 5792 for partition __consumer_offsets-36 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5984 > to 5984 for partition __consumer_offsets-44 > 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5932 > with new epoch 5931 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5979 > to 5979 for partition __consumer_offsets-16 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5804 > to 5804 for partition __consumer_offsets-17 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6179 > to 6179 for partition __consumer_offsets-3 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5953 > to 5953 for partition __consumer_offsets-45 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6276 > to 6276 for partition __consumer_offsets-24 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6119 > to 6119 for partition __consumer_offsets-38 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5784 > to 5784 for partition __consumer_offsets-33 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5665 > to 5665 for partition __consumer_offsets-23 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5812 > to 5812 for partition __consumer_offsets-28 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6014 > to 6014 for partition __consumer_offsets-2 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6199 > to 6199 for partition __consumer_offsets-12 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5908 > to 5908 for partition __consumer_offsets-19 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5697 > to 5697 for partition __consumer_offsets-14 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5673 > to 5673 for partition __consumer_offsets-4 > 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5921 > with new epoch 5920 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5940 > to 5940 for partition __consumer_offsets-49 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5836 > to 5836 for partition __consumer_offsets-42 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5827 > to 5827 for partition __consumer_offsets-7 > 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5670 > to 5670 for partition __consumer_offsets-21 > 2019-06-03 23:24:15 DEBUG Metadata:370 - Updated cluster metadata version 293 > to MetadataCache{cluster=Cluster(id = 5Cs2i3WbTzKHbrF2GzLwaQ, nodes = > [REDACTED:9094 (id: 2132 rack: us-west-2a), REDACTED:9094 (id: 2133 rack: > us-west-2a), REDACTED:9094 (id: 2131 rack: us-west-2c), REDACTED:9094 (id: > 2136 rack: us-west-2b), REDACTED:9094 (id: 2135 rack: us-west-2c), > REDACTED:9094 (id: 2134 rack: us-west-2b), REDACTED:9094 (id: 2137 rack: > us-west-2c)], partitions = [Partition(topic = __consumer_offsets, partition = > 13, leader = 2136, replicas = [2136,2131,2132], isr = [2132,2136,2131], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 46, > leader = 2134, replicas = [2134,2133,2135], isr = [2135,2134,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 9, > leader = 2132, replicas = [2132,2134,2131], isr = [2132,2134,2131], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 42, > leader = 2133, replicas = [2133,2136,2135], isr = [2135,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 21, > leader = 2136, replicas = [2136,2131,2133], isr = [2131,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 17, > leader = 2133, replicas = [2133,2136,2131], isr = [2131,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 30, > leader = 2136, replicas = [2136,2133,2131], isr = [2136,2133,2131], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 26, > leader = 2132, replicas = [2132,2134,2135], isr = [2135,2132,2134], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 5, > leader = 2136, replicas = [2136,2131,2132], isr = [2132,2131,2136], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 38, > leader = 2134, replicas = [2134,2133,2135], isr = [2135,2134,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 1, > leader = 2132, replicas = [2132,2134,2131], isr = [2132,2134,2131], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 34, > leader = 2132, replicas = [2132,2136,2135], isr = [2135,2132,2136], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 16, > leader = 2134, replicas = [2134,2132,2135], isr = [2132,2134,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 45, > leader = 2133, replicas = [2133,2136,2135], isr = [2135,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 12, > leader = 2133, replicas = [2133,2134,2131], isr = [2134,2131,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 41, > leader = 2136, replicas = [2136,2133,2135], isr = [2135,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 24, > leader = 2136, replicas = [2136,2133,2131], isr = [2131,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 20, > leader = 2132, replicas = [2132,2136,2135], isr = [2132,2136,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 49, > leader = 2136, replicas = [2136,2132,2135], isr = [2135,2132,2136], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 0, > leader = 2134, replicas = [2134,2132,2135], isr = [2132,2134,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 29, > leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 25, > leader = 2134, replicas = [2134,2135,2133], isr = [2134,2133,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 8, > leader = 2133, replicas = [2133,2136,2131], isr = [2131,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 37, > leader = 2133, replicas = [2133,2136,2135], isr = [2135,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 4, > leader = 2133, replicas = [2133,2134,2131], isr = [2134,2131,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 33, > leader = 2136, replicas = [2136,2133,2135], isr = [2135,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 15, > leader = 2132, replicas = [2132,2136,2135], isr = [2132,2136,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 48, > leader = 2132, replicas = [2132,2134,2135], isr = [2135,2132,2134], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 11, > leader = 2132, replicas = [2132,2135,2136], isr = [2132,2136,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 44, > leader = 2133, replicas = [2133,2134,2135], isr = [2135,2134,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 23, > leader = 2132, replicas = [2132,2136,2135], isr = [2132,2136,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 19, > leader = 2136, replicas = [2136,2133,2131], isr = [2131,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 32, > leader = 2136, replicas = [2136,2131,2133], isr = [2131,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 28, > leader = 2136, replicas = [2136,2133,2135], isr = [2135,2136,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 7, > leader = 2136, replicas = [2136,2132,2131], isr = [2132,2136,2131], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 40, > leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 3, > leader = 2136, replicas = [2136,2133,2135], isr = [2136,2133,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 36, > leader = 2134, replicas = [2134,2133,2135], isr = [2135,2134,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 47, > leader = 2132, replicas = [2132,2134,2135], isr = [2135,2132,2134], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 14, > leader = 2134, replicas = [2134,2135,2132], isr = [2132,2134,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 43, > leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 10, > leader = 2132, replicas = [2132,2136,2135], isr = [2132,2136,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 22, > leader = 2134, replicas = [2134,2135,2133], isr = [2134,2133,2135], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 18, > leader = 2136, replicas = [2136,2131,2132], isr = [2132,2131,2136], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 31, > leader = 2133, replicas = [2133,2134,2135], isr = [2135,2134,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 27, > leader = 2134, replicas = [2134,2133,2131], isr = [2134,2131,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 39, > leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 6, > leader = 2133, replicas = [2133,2134,2131], isr = [2134,2131,2133], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 35, > leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], > offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 2, > leader = 2134, replicas = [2134,2135,2132], isr = [2132,2134,2135], > offlineReplicas = [])], controller = REDACTED:9094 (id: 2132 rack: > us-west-2a))} > 2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer > clientId=test.mikemintz.lag-tracker-reproduce, > groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition > __consumer_offsets-46 is unknown for fetching offset > 2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer > clientId=test.mikemintz.lag-tracker-reproduce, > groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition > __consumer_offsets-26 is unknown for fetching offset > 2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer > clientId=test.mikemintz.lag-tracker-reproduce, > groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition > __consumer_offsets-34 is unknown for fetching offset > 2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer > clientId=test.mikemintz.lag-tracker-reproduce, > groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition > __consumer_offsets-29 is unknown for fetching offset > 2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer > clientId=test.mikemintz.lag-tracker-reproduce, > groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition > __consumer_offsets-37 is unknown for fetching offset > 2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer > clientId=test.mikemintz.lag-tracker-reproduce, > groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition > __consumer_offsets-40 is unknown for fetching offset > 2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer > clientId=test.mikemintz.lag-tracker-reproduce, > groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition > __consumer_offsets-47 is unknown for fetching offset > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)