Hello, I'm dealing with a strange issue in production and I'm running out of options what to do about it.
It's a 3 node cluster running Kafka 0.11.0.1 with most topics having replication factor of 2. At some point, the broker that is about do die shrinks ISR for a few partitions just to itself: [2017-09-15 11:25:29,104] INFO Partition [...,12] on broker 3: Shrinking ISR from 3,2 to 3 (kafka.cluster.Partition) [2017-09-15 11:25:29,107] INFO Partition [...,8] on broker 3: Shrinking ISR from 3,1 to 3 (kafka.cluster.Partition) [2017-09-15 11:25:29,108] INFO Partition [...,38] on broker 3: Shrinking ISR from 3,2 to 3 (kafka.cluster.Partition) Then slightly after that, another broker writes errors like this to the log file: [2017-09-15 11:25:45,536] WARN [ReplicaFetcherThread-0-3]: Error in fetch to broker 3, request (type=FetchRequest, replicaId=2, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={...}) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 3 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) There are many of such messages. At that point, I see the number of open file descriptors on the other broker growing. And eventually it crashes with thousands of messages like this: [2017-09-15 11:31:23,273] ERROR Error while accepting connection (kafka.network.Acceptor) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at kafka.network.Acceptor.accept(SocketServer.scala:337) at kafka.network.Acceptor.run(SocketServer.scala:280) at java.lang.Thread.run(Thread.java:745) The file descriptor limit is set to 128k, the number of open file descriptors during normal operation is about 8k, so there is a lot of headroom. I'm not sure if it's the other brokers trying to replicate that kills it, or whether it's clients trying to publish messages. Has anyone seen a behavior like this? I'd appreciate any pointers. Thanks, Lukas