[
https://issues.apache.org/jira/browse/KAFKA-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829525#comment-15829525
]
huxi commented on KAFKA-4674:
-----------------------------
Is it a duplicate of
[KAFKA-3916|https://issues.apache.org/jira/browse/KAFKA-3916]?
> Frequent ISR shrinking and expanding and disconnects among brokers
> ------------------------------------------------------------------
>
> Key: KAFKA-4674
> URL: https://issues.apache.org/jira/browse/KAFKA-4674
> Project: Kafka
> Issue Type: Bug
> Components: controller, core
> Affects Versions: 0.10.0.1
> Environment: OS: Redhat Linux 2.6.32-431.el6.x86_64
> JDK: 1.8.0_45
> Reporter: Kaiming Wan
> Attachments: controller.log.rar, server.log.2017-01-11-14,
> zookeeper.out.2017-01-11.log
>
>
> We use a kafka cluster with 3 brokers in production environment. It works
> well for several month. Recently, we get the UnderReplicatedPartitions>0
> warning mail. When we check the log, we find that the partition is always
> experience ISR shrinking and expanding. And the disconnection exception can
> be found in controller's log.
> We also found some deviant output in zookeeper's log which point to a
> consumer(using old API depends on zookeeper ) which has stopped its work with
> many lags.
> Actually, it is not the first time we encounter this problem. When we
> first met this problem, we also found the same phenomenon and the log output.
> We solve the problem by deleting the consumer node info in zookeeper. Then
> everything goes well.
> However, this time, after we deleting the consumer which already have
> large lag, the frequent ISR shrinking and expanding didn't stop for a very
> long time(serveral hours). Though, the issue didn't affect our consumer and
> producer, we think it will make our cluster unstable. So at last, we solve
> this problem by restart the controller broker.
> And now I wander what cause this problem. I check the source code and
> only know poll timeout will cause disconnection and ISR shrinking. Is the
> issue related to zookeeper because it will not hold too many metadata
> modification and make the replication fetch thread take more time?
> I upload the log file in the attachment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)