[ https://issues.apache.org/jira/browse/KAFKA-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317809#comment-16317809 ]
Andrey Falko commented on KAFKA-4674: ------------------------------------- I just reproduced this on a 5 node test cluster by gradually creating 35k 3x replicated topics and holding a consumer for each topic. I'm running latest 1.0.0. How many partitions do you have and what is their replication factor in your setups? > Frequent ISR shrinking and expanding and disconnects among brokers > ------------------------------------------------------------------ > > Key: KAFKA-4674 > URL: https://issues.apache.org/jira/browse/KAFKA-4674 > Project: Kafka > Issue Type: Bug > Components: controller, core > Affects Versions: 0.10.0.1 > Environment: OS: Redhat Linux 2.6.32-431.el6.x86_64 > JDK: 1.8.0_45 > Reporter: Kaiming Wan > Attachments: controller.log.rar, kafkabroker.20170221.log.zip, > server.log.2017-01-11-14, zookeeper.out.2017-01-11.log > > > We use a kafka cluster with 3 brokers in production environment. It works > well for several month. Recently, we get the UnderReplicatedPartitions>0 > warning mail. When we check the log, we find that the partition is always > experience ISR shrinking and expanding. And the disconnection exception can > be found in controller's log. > We also found some deviant output in zookeeper's log which point to a > consumer(using old API depends on zookeeper ) which has stopped its work with > many lags. > Actually, it is not the first time we encounter this problem. When we > first met this problem, we also found the same phenomenon and the log output. > We solve the problem by deleting the consumer node info in zookeeper. Then > everything goes well. > However, this time, after we deleting the consumer which already have > large lag, the frequent ISR shrinking and expanding didn't stop for a very > long time(serveral hours). Though, the issue didn't affect our consumer and > producer, we think it will make our cluster unstable. So at last, we solve > this problem by restart the controller broker. > And now I wander what cause this problem. I check the source code and > only know poll timeout will cause disconnection and ISR shrinking. Is the > issue related to zookeeper because it will not hold too many metadata > modification and make the replication fetch thread take more time? > I upload the log file in the attachment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)