[jira] [Updated] (KAFKA-7457) AbstractCoordinator struck in Discover

2018-09-28 Thread Joseph Aliase (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Aliase updated KAFKA-7457:
-
Summary: AbstractCoordinator struck in Discover  (was: AbstractCoordinator 
struck in discover)

> AbstractCoordinator struck in Discover
> --
>
> Key: KAFKA-7457
> URL: https://issues.apache.org/jira/browse/KAFKA-7457
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 0.10.1.1
> Environment: Linux
>Reporter: Joseph Aliase
>Priority: Minor
>
> AbstractCoordinator in kafka-client is stuck in discover and never rejoins 
> the group. Post restart application is able to join the consumer group and 
> consume from the topic.
> We see below logs every 10 minute. The sequence of events are:
> a) NetworkClient complains that connection is idle and closes the connection.
> b) Consumer client tries to determine co-ordinator by sending request to Node 
> 2.
> c) Node 2 responds by saying Node 3 is group co-ordinator.
> d) Consumer client connects to group co-ordinator.
> e) There is radio silence for 10 minutes and the sequence gets repeated. 
>  
> 2018-09-28 16:35:59.771 TRACE org.apache.kafka.common.network.Selector 
> [pool-4-thread-50] [active] [wc] About to close the idle connection from 
> 2147483644 due to being idle for 540140 millis
> 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient 
> [pool-4-thread-50] [active] [wc] Node 2147483644 disconnected.
> 2018-09-28 16:35:59.771 INFO 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator 
> [pool-4-thread-50] [active] [wc] Marking the coordinator kafka03-wc.net:9092 
> (id: 2147483644 rack: null) dead for group test
> 2018-09-28 16:35:59.771 DEBUG 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator 
> [pool-4-thread-50] [active] [wc] Sending coordinator request for group test 
> to broker kafka02-wc.net:9092 (id: 2 rack: null)
> 2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient 
> [pool-4-thread-50] [active] [wc] Sending metadata request \{topics=[address]} 
> to node 2
> 2018-09-28 16:35:59.796 DEBUG org.apache.kafka.clients.Metadata 
> [pool-4-thread-50] [active] [wc] Updated cluster metadata version 401 to 
> Cluster(id = oC0BPXfOT42WqN7-v6b5Gw, nodes = [kafka02-wc.net:9092 (id: 2 
> rack: null), kafka03-wc.net:9092 (id: 3 rack: null), kafka05-wc.net:9092 (id: 
> 5 rack: null), kafka01-wc.net:9092 (id: 1 rack: null), kafka04-wc.net:9092 
> (id: 4 rack: null)], partitions = [Partition(topic = address, partition = 2, 
> leader = 1, replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, 
> partition = 1, leader = 5, replicas = [5,3,4,], isr = [5,3,4,]), 
> Partition(topic = address, partition = 0, leader = 4, replicas = [2,3,4,], 
> isr = [4,3,2,]), Partition(topic = address, partition = 6, leader = 5, 
> replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 
> 5, leader = 4, replicas = [5,3,4,], isr = [5,4,3,]), Partition(topic = 
> address, partition = 4, leader = 3, replicas = [1,2,3,], isr = [1,3,2,]), 
> Partition(topic = address, partition = 3, leader = 2, replicas = [1,5,2,], 
> isr = [5,1,2,]), Partition(topic = address, partition = 16, leader = 5, 
> replicas = [5,2,3,], isr = [5,3,2,]), Partition(topic = address, partition = 
> 15, leader = 4, replicas = [1,2,4,], isr = [1,4,2,]), Partition(topic = 
> address, partition = 10, leader = 4, replicas = [1,5,4,], isr = [5,1,4,]), 
> Partition(topic = address, partition = 9, leader = 3, replicas = [2,3,4,], 
> isr = [3,4,2,]), Partition(topic = address, partition = 8, leader = 2, 
> replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 
> 7, leader = 1, replicas = [1,5,2,], isr = [5,1,2,]), Partition(topic = 
> address, partition = 14, leader = 3, replicas = [5,3,4,], isr = [5,4,3,]), 
> Partition(topic = address, partition = 13, leader = 2, replicas = [2,3,4,], 
> isr = [3,4,2,]), Partition(topic = address, partition = 12, leader = 1, 
> replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 
> 11, leader = 5, replicas = [1,5,2,], isr = [5,1,2,])])
> 2018-09-28 16:35:59.797 DEBUG 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator 
> [pool-4-thread-50] [active] [wc] Received group coordinator response 
> ClientResponse(receivedTimeMs=1538152559797, disconnected=false, 
> request=ClientRequest(expectResponse=true, 
> callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@2531ff5f,
>  
> request=RequestSend(header=\{api_key=10,api_version=0,correlation_id=8354,client_id=active-wc-1-256},
>  body=\{group_id=test}), createdTimeMs=1538152559771, 
> sendTimeMs=1538152559771), 
> 

[jira] [Created] (KAFKA-7457) AbstractCoordinator struck in discover

2018-09-28 Thread Joseph Aliase (JIRA)
Joseph Aliase created KAFKA-7457:


 Summary: AbstractCoordinator struck in discover
 Key: KAFKA-7457
 URL: https://issues.apache.org/jira/browse/KAFKA-7457
 Project: Kafka
  Issue Type: Bug
  Components: clients
Affects Versions: 0.10.1.1
 Environment: Linux
Reporter: Joseph Aliase


AbstractCoordinator in kafka-client is stuck in discover and never rejoins the 
group. Post restart application is able to join the consumer group and consume 
from the topic.

We see below logs every 10 minute. The sequence of events are:

a) NetworkClient complains that connection is idle and closes the connection.

b) Consumer client tries to determine co-ordinator by sending request to Node 2.

c) Node 2 responds by saying Node 3 is group co-ordinator.

d) Consumer client connects to group co-ordinator.

e) There is radio silence for 10 minutes and the sequence gets repeated. 

 

2018-09-28 16:35:59.771 TRACE org.apache.kafka.common.network.Selector 
[pool-4-thread-50] [active] [wc] About to close the idle connection from 
2147483644 due to being idle for 540140 millis
2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient 
[pool-4-thread-50] [active] [wc] Node 2147483644 disconnected.
2018-09-28 16:35:59.771 INFO 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator 
[pool-4-thread-50] [active] [wc] Marking the coordinator kafka03-wc.net:9092 
(id: 2147483644 rack: null) dead for group test
2018-09-28 16:35:59.771 DEBUG 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator 
[pool-4-thread-50] [active] [wc] Sending coordinator request for group test to 
broker kafka02-wc.net:9092 (id: 2 rack: null)
2018-09-28 16:35:59.771 DEBUG org.apache.kafka.clients.NetworkClient 
[pool-4-thread-50] [active] [wc] Sending metadata request \{topics=[address]} 
to node 2
2018-09-28 16:35:59.796 DEBUG org.apache.kafka.clients.Metadata 
[pool-4-thread-50] [active] [wc] Updated cluster metadata version 401 to 
Cluster(id = oC0BPXfOT42WqN7-v6b5Gw, nodes = [kafka02-wc.net:9092 (id: 2 rack: 
null), kafka03-wc.net:9092 (id: 3 rack: null), kafka05-wc.net:9092 (id: 5 rack: 
null), kafka01-wc.net:9092 (id: 1 rack: null), kafka04-wc.net:9092 (id: 4 rack: 
null)], partitions = [Partition(topic = address, partition = 2, leader = 1, 
replicas = [1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 1, 
leader = 5, replicas = [5,3,4,], isr = [5,3,4,]), Partition(topic = address, 
partition = 0, leader = 4, replicas = [2,3,4,], isr = [4,3,2,]), 
Partition(topic = address, partition = 6, leader = 5, replicas = [1,5,4,], isr 
= [5,1,4,]), Partition(topic = address, partition = 5, leader = 4, replicas = 
[5,3,4,], isr = [5,4,3,]), Partition(topic = address, partition = 4, leader = 
3, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 
3, leader = 2, replicas = [1,5,2,], isr = [5,1,2,]), Partition(topic = address, 
partition = 16, leader = 5, replicas = [5,2,3,], isr = [5,3,2,]), 
Partition(topic = address, partition = 15, leader = 4, replicas = [1,2,4,], isr 
= [1,4,2,]), Partition(topic = address, partition = 10, leader = 4, replicas = 
[1,5,4,], isr = [5,1,4,]), Partition(topic = address, partition = 9, leader = 
3, replicas = [2,3,4,], isr = [3,4,2,]), Partition(topic = address, partition = 
8, leader = 2, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, 
partition = 7, leader = 1, replicas = [1,5,2,], isr = [5,1,2,]), 
Partition(topic = address, partition = 14, leader = 3, replicas = [5,3,4,], isr 
= [5,4,3,]), Partition(topic = address, partition = 13, leader = 2, replicas = 
[2,3,4,], isr = [3,4,2,]), Partition(topic = address, partition = 12, leader = 
1, replicas = [1,2,3,], isr = [1,3,2,]), Partition(topic = address, partition = 
11, leader = 5, replicas = [1,5,2,], isr = [5,1,2,])])
2018-09-28 16:35:59.797 DEBUG 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator 
[pool-4-thread-50] [active] [wc] Received group coordinator response 
ClientResponse(receivedTimeMs=1538152559797, disconnected=false, 
request=ClientRequest(expectResponse=true, 
callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@2531ff5f,
 
request=RequestSend(header=\{api_key=10,api_version=0,correlation_id=8354,client_id=active-wc-1-256},
 body=\{group_id=test}), createdTimeMs=1538152559771, 
sendTimeMs=1538152559771), 
responseBody=\{error_code=0,coordinator={node_id=3,host=kafka03-wc.net,port=9092}})
2018-09-28 16:35:59.797 INFO 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator 
[pool-4-thread-50] [active] [wc] Discovered coordinator kafka03-wc.net:9092 
(id: 2147483644 rack: null) for group test.
2018-09-28 16:35:59.797 INFO 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator 
[pool-4-thread-50] [active] [wc] Marking the coordinator kafka03-wc.net:9092 
(id: 

[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-12-04 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276926#comment-16276926
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~huxi_2b] The only exception I saw the selector class was EOFException which 
should be handled in the IOException block.

We lately have not seen the issue after we upgraded our NIC. But that doesn't 
solve the issue at Kafka level.


> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-08-17 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131341#comment-16131341
 ] 

Joseph Aliase commented on KAFKA-2729:
--

Have happened to us twice in Prod. Restart seems to be a only solution. 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-07-28 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105048#comment-16105048
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~huxi_2b][~junrao] Sorry for the delay. Im working on this today. Will update 
before EOD today

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-06-26 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063456#comment-16063456
 ] 

Joseph Aliase commented on KAFKA-5007:
--

I believe that's the error we are seeing in the log. Let me reproduce the issue 
today. Will confirm

thanks [~huxi_2b]

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)