Team, We are observing ISR shrink and expand very frequently. In the logs of the follower, below errors are observed:
[2018-12-06 20:00:42,709] WARN [ReplicaFetcherThread-2-15], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@a0f9ba9 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 15 was disconnected before the response was read at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114) at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112) at scala.Option.foreach(Option.scala:257) at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112) at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136) at kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142) at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108) at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) Can someone explain this? And help us understand how we can resolve these under-replicated partitions. server.properties file: broker.id=15 port=9092 zookeeper.connect=zk1,zk2,zk3,zk4,zk5,zk6 default.replication.factor=2 log.dirs=/data/kafka delete.topic.enable=true zookeeper.session.timeout.ms=10000 inter.broker.protocol.version=0.10.2 num.partitions=3 min.insync.replicas=1 log.retention.ms=259200000 message.max.bytes=20971520 replica.fetch.max.bytes=20971520 replica.fetch.response.max.bytes=20971520 max.partition.fetch.bytes=20971520 fetch.max.bytes=20971520 log.flush.interval.ms=5000 log.roll.hours=24 num.replica.fetchers=3 num.io.threads=8 num.network.threads=6 log.message.format.version=0.9.0.1 Also In what cases we lead to this state? We have 1200-1400 topics and 5000-6000 partitions spread across 20 node cluster. But only 30-40 partitions are under-replicated while rest are in-sync. 95% of these partitions are having 2 replication factor. -- *Suman*