Kaiming Wan created KAFKA-4674:
----------------------------------

             Summary: Frequent ISR shrinking and expanding and disconnects 
among brokers
                 Key: KAFKA-4674
                 URL: https://issues.apache.org/jira/browse/KAFKA-4674
             Project: Kafka
          Issue Type: Bug
          Components: controller, core
    Affects Versions: 0.10.0.1
         Environment: OS: Redhat Linux 2.6.32-431.el6.x86_64
JDK: 1.8.0_45
            Reporter: Kaiming Wan
         Attachments: controller.log.rar, server.log.2017-01-11-14, 
zookeeper.out.2017-01-11.log

    We use a kafka cluster with 3 brokers in production environment. It works 
well for several month. Recently, we get the UnderReplicatedPartitions>0 
warning mail. When we check the log, we find that the partition is always 
experience ISR shrinking and expanding. And the disconnection exception can be 
found in controller's log.
    We also found some deviant output in zookeeper's log which point to a 
consumer(using old API depends on zookeeper ) which has stopped its work with 
many lags.
    Actually, it is not the first time we encounter this problem. When we first 
met this problem, we also found the same phenomenon and the log output. We 
solve the problem by deleting the consumer node info in zookeeper. Then 
everything goes well.
    However, this time, after we deleting the consumer which already have large 
lag, the frequent ISR shrinking and expanding didn't stop for a very long 
time(serveral hours). Though, the issue didn't affect our consumer and 
producer, we think it will make our cluster unstable. So at last, we solve this 
problem by restart the controller broker.

    And now I wander what cause this problem. I check the source code and only 
know poll timeout will cause disconnection and ISR shrinking. Is the issue 
related to zookeeper because it will not hold too many metadata modification 
and make the replication fetch thread take more time?

I upload the log file in the attachment.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to