Hey!

I've run into something concerning in our production cluster....I believe
I've posted this question to the mailing list previously (
http://mail-archives.apache.org/mod_mbox/kafka-users/201609.mbox/browser)
but the problem has become considerably more serious.

We've been fighting issues where Kafka 0.10.0.1 hits its max file
descriptor limit.  Our limit is set to ~16k, and under normal operation it
holds steady around 4k open files.

But occasionally Kafka will roll a new log segment, which typically takes
on the order of magnitude of a few milliseconds.  However...sometimes it
will take a considerable amount of time, any where from 40 seconds up to
over a minute.  When this happens, it seems like connections are not
released by kafka, and we end up with thousands of client connections stuck
in CLOSE_WAIT, which pile up and exceed our max file descriptor limit.
This happens all in the span of about a minute.

Our logs look like this:

[2017-01-08 01:10:17,117] INFO Rolled new log segment for 'MyTopic-8' in
> 41122 ms. (kafka.log.Log)
> [2017-01-08 01:10:32,550] INFO Rolled new log segment for 'MyTopic-4' in 1
> ms. (kafka.log.Log)
> [2017-01-08 01:11:10,039] INFO [Group Metadata Manager on Broker 4]:
> Removed 0 expired offsets in 0 milliseconds.
> (kafka.coordinator.GroupMetadataManager)
> [2017-01-08 01:19:02,877] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.io.IOException: Too many open files       at
> sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>
        at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>         at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>         at kafka.network.Acceptor.accept(SocketServer.scala:323)
>         at kafka.network.Acceptor.run(SocketServer.scala:268)
>         at java.lang.Thread.run(Thread.java:745)
> [2017-01-08 01:19:02,877] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>         at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>         at kafka.network.Acceptor.accept(SocketServer.scala:323)
>         at kafka.network.Acceptor.run(SocketServer.scala:268)
>         at java.lang.Thread.run(Thread.java:745)
> .....
>


And then kafka crashes.

Has anyone seen this behavior of slow log segmented being rolled?  Any
ideas of how to track down what could be causing this?

Thanks!
Stephen

Reply via email to