[ 
https://issues.apache.org/jira/browse/KAFKA-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317809#comment-16317809
 ] 

Andrey Falko commented on KAFKA-4674:
-------------------------------------

I just reproduced this on a 5 node test cluster by gradually creating 35k 3x 
replicated topics and holding a consumer for each topic. I'm running latest 
1.0.0.

How many partitions do you have and what is their replication factor in your 
setups?

> Frequent ISR shrinking and expanding and disconnects among brokers
> ------------------------------------------------------------------
>
>                 Key: KAFKA-4674
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4674
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, core
>    Affects Versions: 0.10.0.1
>         Environment: OS: Redhat Linux 2.6.32-431.el6.x86_64
> JDK: 1.8.0_45
>            Reporter: Kaiming Wan
>         Attachments: controller.log.rar, kafkabroker.20170221.log.zip, 
> server.log.2017-01-11-14, zookeeper.out.2017-01-11.log
>
>
>     We use a kafka cluster with 3 brokers in production environment. It works 
> well for several month. Recently, we get the UnderReplicatedPartitions>0 
> warning mail. When we check the log, we find that the partition is always 
> experience ISR shrinking and expanding. And the disconnection exception can 
> be found in controller's log.
>     We also found some deviant output in zookeeper's log which point to a 
> consumer(using old API depends on zookeeper ) which has stopped its work with 
> many lags.
>     Actually, it is not the first time we encounter this problem. When we 
> first met this problem, we also found the same phenomenon and the log output. 
> We solve the problem by deleting the consumer node info in zookeeper. Then 
> everything goes well.
>     However, this time, after we deleting the consumer which already have 
> large lag, the frequent ISR shrinking and expanding didn't stop for a very 
> long time(serveral hours). Though, the issue didn't affect our consumer and 
> producer, we think it will make our cluster unstable. So at last, we solve 
> this problem by restart the controller broker.
>     And now I wander what cause this problem. I check the source code and 
> only know poll timeout will cause disconnection and ISR shrinking. Is the 
> issue related to zookeeper because it will not hold too many metadata 
> modification and make the replication fetch thread take more time?
> I upload the log file in the attachment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to