Joseph Aliase created KAFKA-7457: ------------------------------------ Summary: AbstractCoordinator struck in discover Key: KAFKA-7457 URL: https://issues.apache.org/jira/browse/KAFKA-7457 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 0.10.1.1 Environment: Linux Reporter: Joseph Aliase
AbstractCoordinator in kafka-client is stuck in discover and never rejoins the group. Post restart application is able to join the consumer group and consume from the topic. We see below logs every 10 minute. The sequence of events are: a) NetworkClient complains that connection is idle and closes the connection. b) Consumer client tries to determine co-ordinator by sending request to Node 2. c) Node 2 responds by saying Node 3 is group co-ordinator. d) Consumer client connects to group co-ordinator. e) There is radio silence for 10 minutes and the sequence gets repeated. 2018-09-28 16:35:59.771 TRACE org.apache.kafka.common.network.Selector [pool-4-thread-50] [active] [wc] About to close the idle connection from 2147483644 due to being idle for 540140 millis 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient [pool-4-thread-50] [active] [wc] Node 2147483644 disconnected. 2018-09-28 16:35:59.771 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Marking the coordinator kafka03-wc.net:9092 (id: 2147483644 rack: null) dead for group test 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Sending coordinator request for group test to broker kafka02-wc.net:9092 (id: 2 rack: null) 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient [pool-4-thread-50] [active] [wc] Sending metadata request \{topics=[address]} to node 2 2018-09-28 16:35:59.796 DEBUG org.apache.kafka.clients.Metadata [pool-4-thread-50] [active] [wc] Updated cluster metadata version 401 to Cluster(id = oC0BPXfOT42WqN7-v6b5Gw, nodes = [kafka02-wc.net:9092 (id: 2 rack: null), kafka03-wc.net:9092 (id: 3 rack: null), kafka05-wc.net:9092 (id: 5 rack: null), kafka01-wc.net:9092 (id: 1 rack: null), kafka04-wc.net:9092 (id: 4 rack: null)], partitions = [Partition(topic = address, partition = 2, leader = 1, replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 1, leader = 5, replicas = [5,3,4,], isr = [5,3,4,]), Partition(topic = address, partition = 0, leader = 4, replicas = [2,3,4,], isr = [4,3,2,]), Partition(topic = address, partition = 6, leader = 5, replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 5, leader = 4, replicas = [5,3,4,], isr = [5,4,3,]), Partition(topic = address, partition = 4, leader = 3, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 3, leader = 2, replicas = [1,5,2,], isr = [5,1,2,]), Partition(topic = address, partition = 16, leader = 5, replicas = [5,2,3,], isr = [5,3,2,]), Partition(topic = address, partition = 15, leader = 4, replicas = [1,2,4,], isr = [1,4,2,]), Partition(topic = address, partition = 10, leader = 4, replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 9, leader = 3, replicas = [2,3,4,], isr = [3,4,2,]), Partition(topic = address, partition = 8, leader = 2, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 7, leader = 1, replicas = [1,5,2,], isr = [5,1,2,]), Partition(topic = address, partition = 14, leader = 3, replicas = [5,3,4,], isr = [5,4,3,]), Partition(topic = address, partition = 13, leader = 2, replicas = [2,3,4,], isr = [3,4,2,]), Partition(topic = address, partition = 12, leader = 1, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 11, leader = 5, replicas = [1,5,2,], isr = [5,1,2,])]) 2018-09-28 16:35:59.797 DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Received group coordinator response ClientResponse(receivedTimeMs=1538152559797, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@2531ff5f, request=RequestSend(header=\{api_key=10,api_version=0,correlation_id=8354,client_id=active-wc-1-256}, body=\{group_id=test}), createdTimeMs=1538152559771, sendTimeMs=1538152559771), responseBody=\{error_code=0,coordinator={node_id=3,host=kafka03-wc.net,port=9092}}) 2018-09-28 16:35:59.797 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Discovered coordinator kafka03-wc.net:9092 (id: 2147483644 rack: null) for group test. 2018-09-28 16:35:59.797 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Marking the coordinator kafka03-wc.net:9092 (id: 2147483644 rack: null) dead for group test 2018-09-28 16:36:09.797 DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Sending coordinator request for group test to broker kafka02-wc.net:9092 (id: 2 rack: null) 2018-09-28 16:36:09.822 DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Received group coordinator response ClientResponse(receivedTimeMs=1538152569822, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@773b5447, request=RequestSend(header=\{api_key=10,api_version=0,correlation_id=8356,client_id=active-wc-1-256}, body=\{group_id=test}), createdTimeMs=1538152569797, sendTimeMs=1538152569797), responseBody=\{error_code=0,coordinator={node_id=3,host=kafka03-wc.net,port=9092}}) 2018-09-28 16:36:09.822 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Discovered coordinator kafka03-wc.net:9092 (id: 2147483644 rack: null) for group test. 2018-09-28 16:36:09.822 DEBUG org.apache.kafka.clients.NetworkClient [pool-4-thread-50] [active] [wc] Initiating connection to node 2147483644 at kafka03-wc.net:9092. 2018-09-28 16:36:09.846 DEBUG org.apache.kafka.common.network.Selector [pool-4-thread-50] [active] [wc] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node 2147483644 2018-09-28 16:36:09.846 DEBUG org.apache.kafka.clients.NetworkClient [pool-4-thread-50] [active] [wc] Completed connection to node 2147483644 -- This message was sent by Atlassian JIRA (v7.6.3#76005)