[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008959#comment-16008959
 ] 

Joseph Aliase commented on KAFKA-5007:
--------------------------------------

[~cmccabe] PID: 24739  is Kafka process. Let me give you a background. Initial 
when this issue occurred in Prod, Kafka process died so I was not able to 
collect open file descriptors.

We knew the issue was triggered by NIC. So to replicate it in DEV I brought the 
NIC down on the server and saw the descriptors growing. That's is the unix 
socket you are referring too.

Issues reoccurred in prod but this time I was able to collect stats like lsof 
those are attached files.

I don't see this has an issue between zookeeper and Kafka.

It's easy to reproduce the issue in MacBook. Just start a Kafka Cluster in 
remote system with one broker in your macbook and ingest some data in test 
topic. After some time bring down the internet, you would start seeing replica 
fetcher thread error message and the open file descriptor rising.



> Kafka Replica Fetcher Thread- Resource Leak
> -------------------------------------------
>
>                 Key: KAFKA-5007
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5007
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, network
>    Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
>         Environment: Centos 7
> Jave 8
>            Reporter: Joseph Aliase
>            Priority: Critical
>              Labels: reliability
>         Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 100000.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to