[jira] [Commented] (KAFKA-5153) KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting
[ https://issues.apache.org/jira/browse/KAFKA-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046534#comment-16046534 ] Bart Vercammen commented on KAFKA-5153: --- [~arpan.khagram0...@gmail.com] Can you confirm that changing the default config fixed the issue for you? {noformat} replica.fetch.wait.max.ms replica.lag.time.max.ms {noformat} We're also hitting this issue continuously on our clusters, running Kafka 0.10.1.1, and also encountered it (or at least something with the same symptoms) on our Kafka 0.10.2.1 clusters. Still need to investigate more in detail what is actually triggering this, so any tips or insights would be welcome ... > KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting > --- > > Key: KAFKA-5153 > URL: https://issues.apache.org/jira/browse/KAFKA-5153 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.0 > Environment: RHEL 6 > Java Version 1.8.0_91-b14 >Reporter: Arpan >Priority: Critical > Attachments: server_1_72server.log, server_2_73_server.log, > server_3_74Server.log, server.properties, ThreadDump_1493564142.dump, > ThreadDump_1493564177.dump, ThreadDump_1493564249.dump > > > Hi Team, > I was earlier referring to issue KAFKA-4477 because the problem i am facing > is similar. I tried to search the same reference in release docs as well but > did not get anything in 0.10.1.1 or 0.10.2.0. I am currently using > 2.11_0.10.2.0. > I am have 3 node cluster for KAFKA and cluster for ZK as well on the same set > of servers in cluster mode. We are having around 240GB of data getting > transferred through KAFKA everyday. What we are observing is disconnect of > the server from cluster and ISR getting reduced and it starts impacting > service. > I have also observed file descriptor count getting increased a bit, in normal > circumstances we have not observed FD count more than 500 but when issue > started we were observing it in the range of 650-700 on all 3 servers. > Attaching thread dumps of all 3 servers when we started facing the issue > recently. > The issue get vanished once you bounce the nodes and the set up is not > working more than 5 days without this issue. Attaching server logs as well. > Kindly let me know if you need any additional information. Attaching > server.properties as well for one of the server (It's similar on all 3 > serversP) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5153) KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting
[ https://issues.apache.org/jira/browse/KAFKA-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042350#comment-16042350 ] Qingsong Xie commented on KAFKA-5153: - [~dhirajpraj] hi, this bug has been fixed in 0.10.1.1 . I've upgraded kafka to 0.10.1.1 and kafka runs normally for 2 weeks now~ > KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting > --- > > Key: KAFKA-5153 > URL: https://issues.apache.org/jira/browse/KAFKA-5153 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.0 > Environment: RHEL 6 > Java Version 1.8.0_91-b14 >Reporter: Arpan >Priority: Critical > Attachments: server_1_72server.log, server_2_73_server.log, > server_3_74Server.log, server.properties, ThreadDump_1493564142.dump, > ThreadDump_1493564177.dump, ThreadDump_1493564249.dump > > > Hi Team, > I was earlier referring to issue KAFKA-4477 because the problem i am facing > is similar. I tried to search the same reference in release docs as well but > did not get anything in 0.10.1.1 or 0.10.2.0. I am currently using > 2.11_0.10.2.0. > I am have 3 node cluster for KAFKA and cluster for ZK as well on the same set > of servers in cluster mode. We are having around 240GB of data getting > transferred through KAFKA everyday. What we are observing is disconnect of > the server from cluster and ISR getting reduced and it starts impacting > service. > I have also observed file descriptor count getting increased a bit, in normal > circumstances we have not observed FD count more than 500 but when issue > started we were observing it in the range of 650-700 on all 3 servers. > Attaching thread dumps of all 3 servers when we started facing the issue > recently. > The issue get vanished once you bounce the nodes and the set up is not > working more than 5 days without this issue. Attaching server logs as well. > Kindly let me know if you need any additional information. Attaching > server.properties as well for one of the server (It's similar on all 3 > serversP) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-5153) KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting
[ https://issues.apache.org/jira/browse/KAFKA-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019542#comment-16019542 ] dhiraj prajapati commented on KAFKA-5153: - Hi all, We have a 3-node cluster on our production environment. We recently upgraded kafka from 0.9.0.1 to 0.10.1.0 and we are seeing a similar issue of intermittent disconnection. We never had this issue in 0.9.0.1. Below is the exception stack trace: [2017-05-15 09:33:55,398] WARN [ReplicaFetcherThread-0-2], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7213d6d (kafka.server. ReplicaFetcherThread) java.io.IOException: Connection to 2 was disconnected before the response was read at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.sca la:115) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.sca la:112) at scala.Option.foreach(Option.scala:257) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108) at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137) at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scal a:143) at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) Is there a fix for this issue in any of the kafka 10 versions? > KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting > --- > > Key: KAFKA-5153 > URL: https://issues.apache.org/jira/browse/KAFKA-5153 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.0 > Environment: RHEL 6 > Java Version 1.8.0_91-b14 >Reporter: Arpan >Priority: Critical > Attachments: server_1_72server.log, server_2_73_server.log, > server_3_74Server.log, server.properties, ThreadDump_1493564142.dump, > ThreadDump_1493564177.dump, ThreadDump_1493564249.dump > > > Hi Team, > I was earlier referring to issue KAFKA-4477 because the problem i am facing > is similar. I tried to search the same reference in release docs as well but > did not get anything in 0.10.1.1 or 0.10.2.0. I am currently using > 2.11_0.10.2.0. > I am have 3 node cluster for KAFKA and cluster for ZK as well on the same set > of servers in cluster mode. We are having around 240GB of data getting > transferred through KAFKA everyday. What we are observing is disconnect of > the server from cluster and ISR getting reduced and it starts impacting > service. > I have also observed file descriptor count getting increased a bit, in normal > circumstances we have not observed FD count more than 500 but when issue > started we were observing it in the range of 650-700 on all 3 servers. > Attaching thread dumps of all 3 servers when we started facing the issue > recently. > The issue get vanished once you bounce the nodes and the set up is not > working more than 5 days without this issue. Attaching server logs as well. > Kindly let me know if you need any additional information. Attaching > server.properties as well for one of the server (It's similar on all 3 > serversP) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-5153) KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting
[ https://issues.apache.org/jira/browse/KAFKA-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019115#comment-16019115 ] Qingsong Xie commented on KAFKA-5153: - [~arpan.khagram0...@gmail.com] Hi, did this issue occur again? > KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting > --- > > Key: KAFKA-5153 > URL: https://issues.apache.org/jira/browse/KAFKA-5153 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.0 > Environment: RHEL 6 > Java Version 1.8.0_91-b14 >Reporter: Arpan >Priority: Critical > Attachments: server_1_72server.log, server_2_73_server.log, > server_3_74Server.log, server.properties, ThreadDump_1493564142.dump, > ThreadDump_1493564177.dump, ThreadDump_1493564249.dump > > > Hi Team, > I was earlier referring to issue KAFKA-4477 because the problem i am facing > is similar. I tried to search the same reference in release docs as well but > did not get anything in 0.10.1.1 or 0.10.2.0. I am currently using > 2.11_0.10.2.0. > I am have 3 node cluster for KAFKA and cluster for ZK as well on the same set > of servers in cluster mode. We are having around 240GB of data getting > transferred through KAFKA everyday. What we are observing is disconnect of > the server from cluster and ISR getting reduced and it starts impacting > service. > I have also observed file descriptor count getting increased a bit, in normal > circumstances we have not observed FD count more than 500 but when issue > started we were observing it in the range of 650-700 on all 3 servers. > Attaching thread dumps of all 3 servers when we started facing the issue > recently. > The issue get vanished once you bounce the nodes and the set up is not > working more than 5 days without this issue. Attaching server logs as well. > Kindly let me know if you need any additional information. Attaching > server.properties as well for one of the server (It's similar on all 3 > serversP) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-5153) KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting
[ https://issues.apache.org/jira/browse/KAFKA-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996597#comment-15996597 ] Arpan commented on KAFKA-5153: -- Hi @huxi - I did not set anything specifically in server.properties for any of the mentioned property by you. So it must be taking default value as per kafka documentation which is this - replica.fetch.wait.max.ms --- 500 replica.lag.time.max.ms --- 1 (which is 20 times of replica.fetch.wait.max.ms) I am changing my cluster today to updated values as below replica.fetch.wait.max.ms --- 1 replica.lag.time.max.ms --- 20 (which is 20 times of replica.fetch.wait.max.ms) Kindly let me know it should be OK. Shall monitor it for next few days after this and see if this issue occurs again. Thanks. > KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting > --- > > Key: KAFKA-5153 > URL: https://issues.apache.org/jira/browse/KAFKA-5153 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.0 > Environment: RHEL 6 > Java Version 1.8.0_91-b14 >Reporter: Arpan >Priority: Critical > Attachments: server_1_72server.log, server_2_73_server.log, > server_3_74Server.log, server.properties, ThreadDump_1493564142.dump, > ThreadDump_1493564177.dump, ThreadDump_1493564249.dump > > > Hi Team, > I was earlier referring to issue KAFKA-4477 because the problem i am facing > is similar. I tried to search the same reference in release docs as well but > did not get anything in 0.10.1.1 or 0.10.2.0. I am currently using > 2.11_0.10.2.0. > I am have 3 node cluster for KAFKA and cluster for ZK as well on the same set > of servers in cluster mode. We are having around 240GB of data getting > transferred through KAFKA everyday. What we are observing is disconnect of > the server from cluster and ISR getting reduced and it starts impacting > service. > I have also observed file descriptor count getting increased a bit, in normal > circumstances we have not observed FD count more than 500 but when issue > started we were observing it in the range of 650-700 on all 3 servers. > Attaching thread dumps of all 3 servers when we started facing the issue > recently. > The issue get vanished once you bounce the nodes and the set up is not > working more than 5 days without this issue. Attaching server logs as well. > Kindly let me know if you need any additional information. Attaching > server.properties as well for one of the server (It's similar on all 3 > serversP) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-5153) KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting
[ https://issues.apache.org/jira/browse/KAFKA-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994197#comment-15994197 ] huxi commented on KAFKA-5153: - Could you set `replica.lag.time.max.ms`, `replica.fetch.wait.max.ms` to larger values and retry? > KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting > --- > > Key: KAFKA-5153 > URL: https://issues.apache.org/jira/browse/KAFKA-5153 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.0 > Environment: RHEL 6 > Java Version 1.8.0_91-b14 >Reporter: Arpan >Priority: Critical > Attachments: server_1_72server.log, server_2_73_server.log, > server_3_74Server.log, server.properties, ThreadDump_1493564142.dump, > ThreadDump_1493564177.dump, ThreadDump_1493564249.dump > > > Hi Team, > I was earlier referring to issue KAFKA-4477 because the problem i am facing > is similar. I tried to search the same reference in release docs as well but > did not get anything in 0.10.1.1 or 0.10.2.0. I am currently using > 2.11_0.10.2.0. > I am have 3 node cluster for KAFKA and cluster for ZK as well on the same set > of servers in cluster mode. We are having around 240GB of data getting > transferred through KAFKA everyday. What we are observing is disconnect of > the server from cluster and ISR getting reduced and it starts impacting > service. > I have also observed file descriptor count getting increased a bit, in normal > circumstances we have not observed FD count more than 500 but when issue > started we were observing it in the range of 650-700 on all 3 servers. > Attaching thread dumps of all 3 servers when we started facing the issue > recently. > The issue get vanished once you bounce the nodes and the set up is not > working more than 5 days without this issue. Attaching server logs as well. > Kindly let me know if you need any additional information. Attaching > server.properties as well for one of the server (It's similar on all 3 > serversP) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-5153) KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting
[ https://issues.apache.org/jira/browse/KAFKA-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993116#comment-15993116 ] Arpan commented on KAFKA-5153: -- Yes KAFKA-4477 was fixed by KAFKFA-3994 in 0.10.1.1 but still it should not have come in 0.10.2.0 if it's the same issue. This is the reason I have opened separate issue ? Am I missing something in configuration ? > KAFKA Cluster : 0.10.2.0 : Servers Getting disconnected : Service Impacting > --- > > Key: KAFKA-5153 > URL: https://issues.apache.org/jira/browse/KAFKA-5153 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.0 > Environment: RHEL 6 > Java Version 1.8.0_91-b14 >Reporter: Arpan >Priority: Critical > Attachments: server_1_72server.log, server_2_73_server.log, > server_3_74Server.log, server.properties, ThreadDump_1493564142.dump, > ThreadDump_1493564177.dump, ThreadDump_1493564249.dump > > > Hi Team, > I was earlier referring to issue KAFKA-4477 because the problem i am facing > is similar. I tried to search the same reference in release docs as well but > did not get anything in 0.10.1.1 or 0.10.2.0. I am currently using > 2.11_0.10.2.0. > I am have 3 node cluster for KAFKA and cluster for ZK as well on the same set > of servers in cluster mode. We are having around 240GB of data getting > transferred through KAFKA everyday. What we are observing is disconnect of > the server from cluster and ISR getting reduced and it starts impacting > service. > I have also observed file descriptor count getting increased a bit, in normal > circumstances we have not observed FD count more than 500 but when issue > started we were observing it in the range of 650-700 on all 3 servers. > Attaching thread dumps of all 3 servers when we started facing the issue > recently. > The issue get vanished once you bounce the nodes and the set up is not > working more than 5 days without this issue. Attaching server logs as well. > Kindly let me know if you need any additional information. Attaching > server.properties as well for one of the server (It's similar on all 3 > serversP) -- This message was sent by Atlassian JIRA (v6.3.15#6346)