[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

cuiyang (JIRA) Fri, 12 May 2017 18:57:27 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-5007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009039#comment-16009039
 ]


cuiyang commented on KAFKA-5007:
--------------------------------

[~joseph.alias...@gmail.com] [~junrao] 
we also encounter this issu 3 times in last 3 weeks on our prod environment.
Our Kafka cluster version is 0.9.0.1 and 0.10.1.2. I am sure both version 
0.9.0.1 and 0.10.1.2 has this issue.
We set the fd to 100000 by "ulimit -c 100000", but it can not work.
When the issue happened, we monitor the fd on broker, but it is not much:
2017-05-12-22:39:56 FD_total_num:8153 FD_pair_num:6205 FD_ads_num:1459
2017-05-12-22:40:07 FD_total_num:8157 FD_pair_num:6206 FD_ads_num:1459
2017-05-12-22:40:18 FD_total_num:8155 FD_pair_num:6207 FD_ads_num:1459
2017-05-12-22:40:29 FD_total_num:8158 FD_pair_num:6208 FD_ads_num:1460
2017-05-12-22:40:40 FD_total_num:8160 FD_pair_num:6211 FD_ads_num:1461
2017-05-12-22:40:51 FD_total_num:8162 FD_pair_num:6213 FD_ads_num:1461
2017-05-12-22:41:02 FD_total_num:8172 FD_pair_num:6214 FD_ads_num:1462
2017-05-12-22:41:13 FD_total_num:8167 FD_pair_num:6214 FD_ads_num:1462
2017-05-12-22:41:24 FD_total_num:8172 FD_pair_num:6215 FD_ads_num:1462
2017-05-12-22:41:36 FD_total_num:8172 FD_pair_num:6216 FD_ads_num:1462
2017-05-12-22:41:47 FD_total_num:8169 FD_pair_num:6216 FD_ads_num:1462
2017-05-12-22:41:58 FD_total_num:8193 FD_pair_num:6216 FD_ads_num:1462
2017-05-12-22:42:08 FD_total_num:0 FD_pair_num:0 FD_ads_num:0
2017-05-12-22:42:19 FD_total_num:0 FD_pair_num:0 FD_ads_num:0
2017-05-12-22:42:29 FD_total_num:0 FD_pair_num:0 FD_ads_num:0
On 2017-05-12-22:42:08, FD is 0, because the broker is down.


> Kafka Replica Fetcher Thread- Resource Leak
> -------------------------------------------
>
>                 Key: KAFKA-5007
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5007
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, network
>    Affects Versions: 0.10.0.0, 0.10.1.1, 0.10.2.0
>         Environment: Centos 7
> Jave 8
>            Reporter: Joseph Aliase
>            Priority: Critical
>              Labels: reliability
>         Attachments: jstack-kafka.out, jstack-zoo.out, lsofkafka.txt, 
> lsofzookeeper.txt
>
>
> Kafka is running out of open file descriptor when system network interface is 
> done.
> Issue description:
> We have a Kafka Cluster of 5 node running on version 0.10.1.1. The open file 
> descriptor for the account running Kafka is set to 100000.
> During an upgrade, network interface went down. Outage continued for 12 hours 
> eventually all the broker crashed with java.io.IOException: Too many open 
> files error.
> We repeated the test in a lower environment and observed that Open Socket 
> count keeps on increasing while the NIC is down.
> We have around 13 topics with max partition size of 120 and number of replica 
> fetcher thread is set to 8.
> Using an internal monitoring tool we observed that Open Socket descriptor   
> for the broker pid continued to increase although NIC was down leading to  
> Open File descriptor error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KAFKA-5007) Kafka Replica Fetcher Thread- Resource Leak

Reply via email to