[ https://issues.apache.org/jira/browse/KAFKA-10127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310982#comment-17310982 ]
Ajay Patel commented on KAFKA-10127: ------------------------------------ You might find luck in [KAFKA-2729|https://issues.apache.org/jira/browse/KAFKA-2729] as that issue sounds identical and has some more traffic. > kafka cluster not recovering - Shrinking ISR continously > --------------------------------------------------------- > > Key: KAFKA-10127 > URL: https://issues.apache.org/jira/browse/KAFKA-10127 > Project: Kafka > Issue Type: Bug > Components: replication, zkclient > Affects Versions: 2.4.1 > Environment: using kafka version 2.4.1 and zookeeper version 3.5.7 > Reporter: Youssef BOUZAIENNE > Priority: Major > > We are actually facing issue from time to time where our kafka cluster goes > into a weird state. We see the following log repeating > [2020-06-06 08:35:48,117] INFO [Partition test broker=1002] Cached zkVersion > 620 not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2020-06-06 08:35:48,117] INFO [Partition test broker=1002] Shrinking ISR > from 1006,1002 to 1002. Leader: (highWatermark: 3222733572, endOffset: > 3222741893). Out of sync replicas: (brokerId: 1006, endOffset: 3222733572). > (kafka.cluster.Partition) > > Just before that our zookeeper session expired which lead us to that state. > > After we increased this two values below we encounter the issue less > frequently but it still appears from time to time and the only solution is > restart of kafka service on all brokers to recover. > zookeeper.session.timeout.ms=18000 > replica.lag.time.max.ms=30000 > > Any thoughts on that please -- This message was sent by Atlassian Jira (v8.3.4#803005)