Hello,

I'm dealing with a strange issue in production and I'm running out of
options what to do about it.

It's a 3 node cluster running Kafka 0.11.0.1 with most topics having
replication factor of 2. At some point, the broker that is about do
die shrinks ISR for a few partitions just to itself:

[2017-09-15 11:25:29,104] INFO Partition [...,12] on broker 3:
Shrinking ISR from 3,2 to 3 (kafka.cluster.Partition)
[2017-09-15 11:25:29,107] INFO Partition [...,8] on broker 3:
Shrinking ISR from 3,1 to 3 (kafka.cluster.Partition)
[2017-09-15 11:25:29,108] INFO Partition [...,38] on broker 3:
Shrinking ISR from 3,2 to 3 (kafka.cluster.Partition)

Then slightly after that, another broker writes errors like this to
the log file:

[2017-09-15 11:25:45,536] WARN [ReplicaFetcherThread-0-3]: Error in
fetch to broker 3, request (type=FetchRequest, replicaId=2,
maxWait=500, minBytes=1, maxBytes=10485760, fetchData={...})
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the
response was read
        at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93)
        at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93)
        at 
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207)
        at 
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)

There are many of such messages. At that point, I see the number of
open file descriptors on the other broker growing. And eventually it
crashes with thousands of messages like this:

[2017-09-15 11:31:23,273] ERROR Error while accepting connection
(kafka.network.Acceptor)
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
        at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
        at kafka.network.Acceptor.accept(SocketServer.scala:337)
        at kafka.network.Acceptor.run(SocketServer.scala:280)
        at java.lang.Thread.run(Thread.java:745)

The file descriptor limit is set to 128k, the number of open file
descriptors during normal operation is about 8k, so there is a lot of
headroom.

I'm not sure if it's the other brokers trying to replicate that kills
it, or whether it's clients trying to publish messages.

Has anyone seen a behavior like this? I'd appreciate any pointers.

Thanks,

Lukas

Reply via email to