Hey! I've run into something concerning in our production cluster....I believe I've posted this question to the mailing list previously ( http://mail-archives.apache.org/mod_mbox/kafka-users/201609.mbox/browser) but the problem has become considerably more serious.
We've been fighting issues where Kafka 0.10.0.1 hits its max file descriptor limit. Our limit is set to ~16k, and under normal operation it holds steady around 4k open files. But occasionally Kafka will roll a new log segment, which typically takes on the order of magnitude of a few milliseconds. However...sometimes it will take a considerable amount of time, any where from 40 seconds up to over a minute. When this happens, it seems like connections are not released by kafka, and we end up with thousands of client connections stuck in CLOSE_WAIT, which pile up and exceed our max file descriptor limit. This happens all in the span of about a minute. Our logs look like this: [2017-01-08 01:10:17,117] INFO Rolled new log segment for 'MyTopic-8' in > 41122 ms. (kafka.log.Log) > [2017-01-08 01:10:32,550] INFO Rolled new log segment for 'MyTopic-4' in 1 > ms. (kafka.log.Log) > [2017-01-08 01:11:10,039] INFO [Group Metadata Manager on Broker 4]: > Removed 0 expired offsets in 0 milliseconds. > (kafka.coordinator.GroupMetadataManager) > [2017-01-08 01:19:02,877] ERROR Error while accepting connection > (kafka.network.Acceptor) > java.io.IOException: Too many open files at > sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at kafka.network.Acceptor.accept(SocketServer.scala:323) > at kafka.network.Acceptor.run(SocketServer.scala:268) > at java.lang.Thread.run(Thread.java:745) > [2017-01-08 01:19:02,877] ERROR Error while accepting connection > (kafka.network.Acceptor) > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at kafka.network.Acceptor.accept(SocketServer.scala:323) > at kafka.network.Acceptor.run(SocketServer.scala:268) > at java.lang.Thread.run(Thread.java:745) > ..... > And then kafka crashes. Has anyone seen this behavior of slow log segmented being rolled? Any ideas of how to track down what could be causing this? Thanks! Stephen