[jira] [Updated] (KAFKA-7457) AbstractCoordinator struck in Discover
[ https://issues.apache.org/jira/browse/KAFKA-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Aliase updated KAFKA-7457: - Summary: AbstractCoordinator struck in Discover (was: AbstractCoordinator struck in discover) > AbstractCoordinator struck in Discover > -- > > Key: KAFKA-7457 > URL: https://issues.apache.org/jira/browse/KAFKA-7457 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 0.10.1.1 > Environment: Linux >Reporter: Joseph Aliase >Priority: Minor > > AbstractCoordinator in kafka-client is stuck in discover and never rejoins > the group. Post restart application is able to join the consumer group and > consume from the topic. > We see below logs every 10 minute. The sequence of events are: > a) NetworkClient complains that connection is idle and closes the connection. > b) Consumer client tries to determine co-ordinator by sending request to Node > 2. > c) Node 2 responds by saying Node 3 is group co-ordinator. > d) Consumer client connects to group co-ordinator. > e) There is radio silence for 10 minutes and the sequence gets repeated. > > 2018-09-28 16:35:59.771 TRACE org.apache.kafka.common.network.Selector > [pool-4-thread-50] [active] [wc] About to close the idle connection from > 2147483644 due to being idle for 540140 millis > 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient > [pool-4-thread-50] [active] [wc] Node 2147483644 disconnected. > 2018-09-28 16:35:59.771 INFO > org.apache.kafka.clients.consumer.internals.AbstractCoordinator > [pool-4-thread-50] [active] [wc] Marking the coordinator kafka03-wc.net:9092 > (id: 2147483644 rack: null) dead for group test > 2018-09-28 16:35:59.771 DEBUG > org.apache.kafka.clients.consumer.internals.AbstractCoordinator > [pool-4-thread-50] [active] [wc] Sending coordinator request for group test > to broker kafka02-wc.net:9092 (id: 2 rack: null) > 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient > [pool-4-thread-50] [active] [wc] Sending metadata request \{topics=[address]} > to node 2 > 2018-09-28 16:35:59.796 DEBUG org.apache.kafka.clients.Metadata > [pool-4-thread-50] [active] [wc] Updated cluster metadata version 401 to > Cluster(id = oC0BPXfOT42WqN7-v6b5Gw, nodes = [kafka02-wc.net:9092 (id: 2 > rack: null), kafka03-wc.net:9092 (id: 3 rack: null), kafka05-wc.net:9092 (id: > 5 rack: null), kafka01-wc.net:9092 (id: 1 rack: null), kafka04-wc.net:9092 > (id: 4 rack: null)], partitions = [Partition(topic = address, partition = 2, > leader = 1, replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, > partition = 1, leader = 5, replicas = [5,3,4,], isr = [5,3,4,]), > Partition(topic = address, partition = 0, leader = 4, replicas = [2,3,4,], > isr = [4,3,2,]), Partition(topic = address, partition = 6, leader = 5, > replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = > 5, leader = 4, replicas = [5,3,4,], isr = [5,4,3,]), Partition(topic = > address, partition = 4, leader = 3, replicas = [1,2,3,], isr = [1,3,2,]), > Partition(topic = address, partition = 3, leader = 2, replicas = [1,5,2,], > isr = [5,1,2,]), Partition(topic = address, partition = 16, leader = 5, > replicas = [5,2,3,], isr = [5,3,2,]), Partition(topic = address, partition = > 15, leader = 4, replicas = [1,2,4,], isr = [1,4,2,]), Partition(topic = > address, partition = 10, leader = 4, replicas = [1,5,4,], isr = [5,1,4,]), > Partition(topic = address, partition = 9, leader = 3, replicas = [2,3,4,], > isr = [3,4,2,]), Partition(topic = address, partition = 8, leader = 2, > replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = > 7, leader = 1, replicas = [1,5,2,], isr = [5,1,2,]), Partition(topic = > address, partition = 14, leader = 3, replicas = [5,3,4,], isr = [5,4,3,]), > Partition(topic = address, partition = 13, leader = 2, replicas = [2,3,4,], > isr = [3,4,2,]), Partition(topic = address, partition = 12, leader = 1, > replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = > 11, leader = 5, replicas = [1,5,2,], isr = [5,1,2,])]) > 2018-09-28 16:35:59.797 DEBUG > org.apache.kafka.clients.consumer.internals.AbstractCoordinator > [pool-4-thread-50] [active] [wc] Received group coordinator response > ClientResponse(receivedTimeMs=1538152559797, disconnected=false, > request=ClientRequest(expectResponse=true, > callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@2531ff5f, > > request=RequestSend(header=\{api_key=10,api_version=0,correlation_id=8354,client_id=active-wc-1-256}, > body=\{group_id=test}), createdTimeMs=1538152559771, > sendTimeMs=1538152559771), >
[jira] [Created] (KAFKA-7457) AbstractCoordinator struck in discover
Joseph Aliase created KAFKA-7457: Summary: AbstractCoordinator struck in discover Key: KAFKA-7457 URL: https://issues.apache.org/jira/browse/KAFKA-7457 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 0.10.1.1 Environment: Linux Reporter: Joseph Aliase AbstractCoordinator in kafka-client is stuck in discover and never rejoins the group. Post restart application is able to join the consumer group and consume from the topic. We see below logs every 10 minute. The sequence of events are: a) NetworkClient complains that connection is idle and closes the connection. b) Consumer client tries to determine co-ordinator by sending request to Node 2. c) Node 2 responds by saying Node 3 is group co-ordinator. d) Consumer client connects to group co-ordinator. e) There is radio silence for 10 minutes and the sequence gets repeated. 2018-09-28 16:35:59.771 TRACE org.apache.kafka.common.network.Selector [pool-4-thread-50] [active] [wc] About to close the idle connection from 2147483644 due to being idle for 540140 millis 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient [pool-4-thread-50] [active] [wc] Node 2147483644 disconnected. 2018-09-28 16:35:59.771 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Marking the coordinator kafka03-wc.net:9092 (id: 2147483644 rack: null) dead for group test 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Sending coordinator request for group test to broker kafka02-wc.net:9092 (id: 2 rack: null) 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient [pool-4-thread-50] [active] [wc] Sending metadata request \{topics=[address]} to node 2 2018-09-28 16:35:59.796 DEBUG org.apache.kafka.clients.Metadata [pool-4-thread-50] [active] [wc] Updated cluster metadata version 401 to Cluster(id = oC0BPXfOT42WqN7-v6b5Gw, nodes = [kafka02-wc.net:9092 (id: 2 rack: null), kafka03-wc.net:9092 (id: 3 rack: null), kafka05-wc.net:9092 (id: 5 rack: null), kafka01-wc.net:9092 (id: 1 rack: null), kafka04-wc.net:9092 (id: 4 rack: null)], partitions = [Partition(topic = address, partition = 2, leader = 1, replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 1, leader = 5, replicas = [5,3,4,], isr = [5,3,4,]), Partition(topic = address, partition = 0, leader = 4, replicas = [2,3,4,], isr = [4,3,2,]), Partition(topic = address, partition = 6, leader = 5, replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 5, leader = 4, replicas = [5,3,4,], isr = [5,4,3,]), Partition(topic = address, partition = 4, leader = 3, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 3, leader = 2, replicas = [1,5,2,], isr = [5,1,2,]), Partition(topic = address, partition = 16, leader = 5, replicas = [5,2,3,], isr = [5,3,2,]), Partition(topic = address, partition = 15, leader = 4, replicas = [1,2,4,], isr = [1,4,2,]), Partition(topic = address, partition = 10, leader = 4, replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 9, leader = 3, replicas = [2,3,4,], isr = [3,4,2,]), Partition(topic = address, partition = 8, leader = 2, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 7, leader = 1, replicas = [1,5,2,], isr = [5,1,2,]), Partition(topic = address, partition = 14, leader = 3, replicas = [5,3,4,], isr = [5,4,3,]), Partition(topic = address, partition = 13, leader = 2, replicas = [2,3,4,], isr = [3,4,2,]), Partition(topic = address, partition = 12, leader = 1, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 11, leader = 5, replicas = [1,5,2,], isr = [5,1,2,])]) 2018-09-28 16:35:59.797 DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Received group coordinator response ClientResponse(receivedTimeMs=1538152559797, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@2531ff5f, request=RequestSend(header=\{api_key=10,api_version=0,correlation_id=8354,client_id=active-wc-1-256}, body=\{group_id=test}), createdTimeMs=1538152559771, sendTimeMs=1538152559771), responseBody=\{error_code=0,coordinator={node_id=3,host=kafka03-wc.net,port=9092}}) 2018-09-28 16:35:59.797 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Discovered coordinator kafka03-wc.net:9092 (id: 2147483644 rack: null) for group test. 2018-09-28 16:35:59.797 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [pool-4-thread-50] [active] [wc] Marking the coordinator kafka03-wc.net:9092 (id:
[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak
[ https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276926#comment-16276926 ] Joseph Aliase commented on KAFKA-5007: -- [~huxi_2b] The only exception I saw the selector class was EOFException which should be handled in the IOException block. We lately have not seen the issue after we upgraded our NIC. But that doesn't solve the issue at Kafka level. > Kafka Replica Fetcher Thread- Resource Leak > --- > > Key: KAFKA-5007 > URL: https://issues.apache.org/jira/browse/KAFKA-5007 > Project: Kafka > Issue Type: Bug > Components: core, network >Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0 > Environment: Centos 7 > Jave 8 >Reporter: Joseph Aliase >Priority: Critical > Labels: reliability > Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, > lsofzookeeper.txt > > > Kafka is running out of open file descriptor when system network interface is > done. > Issue description: > We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file > descriptor for the account running Kafka is set to 10. > During an upgrade, network interface went down. Outage continued for 12 hours > eventually all the broker crashed with java.io.IOException: Too many open > files error. > We repeated the test in a lower environment and observed that Open Socket > count keeps on increasing while the NIC is down. > We have around 13 topics with max partition size of 120 and number of replica > fetcher thread is set to 8. > Using an internal monitoring tool we observed that Open Socket descriptor > for the broker pid continued to increase although NIC was down leading to > Open File descriptor error. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131341#comment-16131341 ] Joseph Aliase commented on KAFKA-2729: -- Have happened to us twice in Prod. Restart seems to be a only solution. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak
[ https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105048#comment-16105048 ] Joseph Aliase commented on KAFKA-5007: -- [~huxi_2b][~junrao] Sorry for the delay. Im working on this today. Will update before EOD today > Kafka Replica Fetcher Thread- Resource Leak > --- > > Key: KAFKA-5007 > URL: https://issues.apache.org/jira/browse/KAFKA-5007 > Project: Kafka > Issue Type: Bug > Components: core, network >Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0 > Environment: Centos 7 > Jave 8 >Reporter: Joseph Aliase >Priority: Critical > Labels: reliability > Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, > lsofzookeeper.txt > > > Kafka is running out of open file descriptor when system network interface is > done. > Issue description: > We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file > descriptor for the account running Kafka is set to 10. > During an upgrade, network interface went down. Outage continued for 12 hours > eventually all the broker crashed with java.io.IOException: Too many open > files error. > We repeated the test in a lower environment and observed that Open Socket > count keeps on increasing while the NIC is down. > We have around 13 topics with max partition size of 120 and number of replica > fetcher thread is set to 8. > Using an internal monitoring tool we observed that Open Socket descriptor > for the broker pid continued to increase although NIC was down leading to > Open File descriptor error. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak
[ https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063456#comment-16063456 ] Joseph Aliase commented on KAFKA-5007: -- I believe that's the error we are seeing in the log. Let me reproduce the issue today. Will confirm thanks [~huxi_2b] > Kafka Replica Fetcher Thread- Resource Leak > --- > > Key: KAFKA-5007 > URL: https://issues.apache.org/jira/browse/KAFKA-5007 > Project: Kafka > Issue Type: Bug > Components: core, network >Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0 > Environment: Centos 7 > Jave 8 >Reporter: Joseph Aliase >Priority: Critical > Labels: reliability > Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, > lsofzookeeper.txt > > > Kafka is running out of open file descriptor when system network interface is > done. > Issue description: > We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file > descriptor for the account running Kafka is set to 10. > During an upgrade, network interface went down. Outage continued for 12 hours > eventually all the broker crashed with java.io.IOException: Too many open > files error. > We repeated the test in a lower environment and observed that Open Socket > count keeps on increasing while the NIC is down. > We have around 13 topics with max partition size of 120 and number of replica > fetcher thread is set to 8. > Using an internal monitoring tool we observed that Open Socket descriptor > for the broker pid continued to increase although NIC was down leading to > Open File descriptor error. -- This message was sent by Atlassian JIRA (v6.4.14#64029)