[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-06-13 Thread huxihx (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048605#comment-16048605
 ] 

huxihx commented on KAFKA-5007:
---

[~joseph.alias...@gmail.com] org.apache.kafka.common.network.Selector

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-06-13 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048586#comment-16048586
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~huxi_2b] Please let me know the class this block belongs too

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-06-13 Thread huxihx (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047561#comment-16047561
 ] 

huxihx commented on KAFKA-5007:
---

[~junrao] is it possible that it 's caused by the code snippet below:


{code:java}
...
SocketChannel socketChannel = SocketChannel.open();
socketChannel.configureBlocking(false);
Socket socket = socketChannel.socket();
socket.setKeepAlive(true);
if (sendBufferSize != Selectable.USE_DEFAULT_BUFFER_SIZE)
socket.setSendBufferSize(sendBufferSize);
if (receiveBufferSize != Selectable.USE_DEFAULT_BUFFER_SIZE)
socket.setReceiveBufferSize(receiveBufferSize);
socket.setTcpNoDelay(true);
boolean connected;
try {
connected = socketChannel.connect(address);
} catch (UnresolvedAddressException e) {
socketChannel.close();
throw new IOException("Can't resolve address: " + address, e);
} catch (IOException e) {
socketChannel.close();
throw e;
}
{code}

The code did not capture all possible exceptions so `socketChannel` got failed 
to be closed.


> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-05-12 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009056#comment-16009056
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~cuiyang] I have replicated this issue in all version except 0.8.

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-05-12 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009052#comment-16009052
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~cuiyang] you are referring to open file descriptor? its:
lsof -p 

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-05-12 Thread cuiyang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009040#comment-16009040
 ] 

cuiyang commented on KAFKA-5007:


@Joseph Aliase Could you give the command which you used to catch log?

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-05-12 Thread cuiyang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16009039#comment-16009039
 ] 

cuiyang commented on KAFKA-5007:


[~joseph.alias...@gmail.com] [~junrao] 
we also encounter this issu 3 times in last 3 weeks on our prod environment.
Our Kafka cluster version is 0.9.0.1 and 0.10.1.2. I am sure both version 
0.9.0.1 and 0.10.1.2 has this issue.
We set the fd to 10 by "ulimit -c 10", but it can not work.
When the issue happened, we monitor the fd on broker, but it is not much:
2017-05-12-22:39:56 FD_total_num:8153 FD_pair_num:6205 FD_ads_num:1459
2017-05-12-22:40:07 FD_total_num:8157 FD_pair_num:6206 FD_ads_num:1459
2017-05-12-22:40:18 FD_total_num:8155 FD_pair_num:6207 FD_ads_num:1459
2017-05-12-22:40:29 FD_total_num:8158 FD_pair_num:6208 FD_ads_num:1460
2017-05-12-22:40:40 FD_total_num:8160 FD_pair_num:6211 FD_ads_num:1461
2017-05-12-22:40:51 FD_total_num:8162 FD_pair_num:6213 FD_ads_num:1461
2017-05-12-22:41:02 FD_total_num:8172 FD_pair_num:6214 FD_ads_num:1462
2017-05-12-22:41:13 FD_total_num:8167 FD_pair_num:6214 FD_ads_num:1462
2017-05-12-22:41:24 FD_total_num:8172 FD_pair_num:6215 FD_ads_num:1462
2017-05-12-22:41:36 FD_total_num:8172 FD_pair_num:6216 FD_ads_num:1462
2017-05-12-22:41:47 FD_total_num:8169 FD_pair_num:6216 FD_ads_num:1462
2017-05-12-22:41:58 FD_total_num:8193 FD_pair_num:6216 FD_ads_num:1462
2017-05-12-22:42:08 FD_total_num:0 FD_pair_num:0 FD_ads_num:0
2017-05-12-22:42:19 FD_total_num:0 FD_pair_num:0 FD_ads_num:0
2017-05-12-22:42:29 FD_total_num:0 FD_pair_num:0 FD_ads_num:0
On 2017-05-12-22:42:08, FD is 0, because the broker is down.


> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-05-12 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008959#comment-16008959
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~cmccabe] PID: 24739  is Kafka process. Let me give you a background. Initial 
when this issue occurred in Prod, Kafka process died so I was not able to 
collect open file descriptors.

We knew the issue was triggered by NIC. So to replicate it in DEV I brought the 
NIC down on the server and saw the descriptors growing. That's is the unix 
socket you are referring too.

Issues reoccurred in prod but this time I was able to collect stats like lsof 
those are attached files.

I don't see this has an issue between zookeeper and Kafka.

It's easy to reproduce the issue in MacBook. Just start a Kafka Cluster in 
remote system with one broker in your macbook and ingest some data in test 
topic. After some time bring down the internet, you would start seeing replica 
fetcher thread error message and the open file descriptor rising.



> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-05-12 Thread Colin P. McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008930#comment-16008930
 ] 

Colin P. McCabe commented on KAFKA-5007:


I tried running a kafka instance that could not communicate with ZooKeeper on 
my laptop, and did not observe any growth in UNIX domain socket (or other 
sockets) which were open.

I also noticed that the inital lsof file attached to the ticket does not show 
UNIX domain sockets.  Did the issue change?

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-05-12 Thread Colin P. McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008699#comment-16008699
 ] 

Colin P. McCabe commented on KAFKA-5007:


{code}
 java 24739 cyclone 1216u unix 0x880035b35000 0t0 42713601 socket
{code}

If I'm reading this right, this is a UNIX domain socket that is being leaked.  
Can you run {{netstat --unix}} and see if the sockets show up in that output?  
Also, I assume that 24739 was the Kafka process ID?

A UNIX domain socket leak in Java code could indicate a problem with how the 
code uses NIO

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-05-12 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008147#comment-16008147
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~junrao] Any thoughts on the files related to the issues I uploaded?
[~ijuma]

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
> Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-18 Thread huxi (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972314#comment-15972314
 ] 

huxi commented on KAFKA-5007:
-

Maybe a complete stack trace is helpful to diagnose what action throws this 
exception.

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-17 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971929#comment-15971929
 ] 

Jun Rao commented on KAFKA-5007:


[~joseph.alias...@gmail.com], the socket leak could be at the ZK client level. 
Do you see any warn/error logging related to ZK? If not, perhaps you can turn 
on debug/trace level logging in org.I0Itec.zkclient.ZkClient and 
org.apache.zookeeper.ZooKeeper, run the test again and see if there is anything 
suspicious?

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-16 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970549#comment-15970549
 ] 

Joseph Aliase commented on KAFKA-5007:
--

It seems ticket https://issues.apache.org/jira/browse/KAFKA-4477 is similar to 
the current ticket.

The ticket was closed since nobody reported in 0.10.1.1. But issue exists in 
0.10.1.1.

It's a major issue and I would I like to work on it. I need guidance on where 
to start.

[~junrao]



> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-07 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961314#comment-15961314
 ] 

Joseph Aliase commented on KAFKA-5007:
--

Those leaked sockets don't show up in netstat. For conducting test I bring down 
the Network interface, as a result, I'm kicked out of the remote host. But I 
collect lsof and netstat using a cron job. And this leaked sockets do not show 
up in netstat.

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-07 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961194#comment-15961194
 ] 

Jun Rao commented on KAFKA-5007:


Hmm, ReplicaFetcherThread is changed to use the client selector implementation 
since 0.9.0, but I am not sure what's causing the issue with a single broker. 
In the single broker test, do you know which port those leaked sockets are 
associated with? Is that the ZK port?

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-07 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960777#comment-15960777
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~junrao][~ijuma] Any help will be appreciated. This is turned out to be a 
blocker for us.

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.1.1
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-06 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959426#comment-15959426
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~junrao] [~ijuma] I repeated the test in 0.8 Kafka Cluster and I don't see the 
open socket issue in 0.8 version.
Adding to that I have replicated the issue in 0.10.0.0 and 0.10.2.0.

Downgrading to 0.8 is not a option for us.



> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.1.1
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>Priority: Critical
>  Labels: reliability
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-06 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958945#comment-15958945
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~junrao]I conducted one more test to confirm that issue is not confined to 
Replica Fetcher thread.

I installed 1 broker and 3 Zookeeper node cluster. And brought down the network 
between broker and zookeeper and I see the same result open files continue to 
elevate. Let me know if you need more info.

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.1.1
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>  Labels: reliability
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-06 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958944#comment-15958944
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~junrao] Thanks for responding back. I do see sockets in close wait but they 
are not causing Open files to fill up. Instead, I see lot of below lines in 
lsof command. This is what is making application hit its open file limit.

java24739 cyclone  696u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  697u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  698u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  699u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  700u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  701u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  702u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  703u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  704u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  705u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  706u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  707u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  708u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  709u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  710u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  711u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  712u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  713u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  714u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  715u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  716u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  717u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  718u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  719u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  720u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  721u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  722u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  723u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  724u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  725u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  726u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  727u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  728u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  729u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  730u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  731u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  732u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  733u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  734u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  735u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  736u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  737u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  738u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  739u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  740u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  741u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  742u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  743u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  744u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  745u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  746u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  747u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  748u unix 0x880035b35000   0t0  42713601 
socket
java24739 cyclone  749u unix 0x880035b35000   0t0  42713601 

[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-05 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958124#comment-15958124
 ] 

Jun Rao commented on KAFKA-5007:


[~joseph.alias...@gmail.com], thanks for reporting this. If the replica fetch 
thread hits an IOException, we always close the connection before creating a 
new one. If you do netstat, do those leaked socket show as in close_wait state?

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.1.1
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-05 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958049#comment-15958049
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~junrao]Any guidance will be helpfull. 

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.1.1
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-05 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957413#comment-15957413
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~gwenshap] Please take a look at this its a major bug

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.1.1
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

2017-04-04 Thread Joseph Aliase (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956036#comment-15956036
 ] 

Joseph Aliase commented on KAFKA-5007:
--

[~ijuma] Can you take look into this issue

> Kafka Replica Fetcher Thread- Resource Leak
> ---
>
> Key: KAFKA-5007
> URL: https://issues.apache.org/jira/browse/KAFKA-5007
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Affects Versions: 0.10.1.1
> Environment: Centos 7
> Jave 8
>Reporter: Joseph Aliase
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 10.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)