Hi,

I've run into a pretty serious production issue with Kafka 0.8.2, and I'm
wondering what my options are.


ReplicaFetcherThread Error

I have a broker on a 9-node cluster that went down for a couple of hours.
When it came back up, it started spewing constant errors of the following
form:

INFO Reconnect due to socket error:
java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer)
[2015-04-09 22:38:54,580] WARN [ReplicaFetcherThread-0-7], Error in fetch
Name: FetchRequest; Version: 0; CorrelationId: 767; ClientId:
ReplicaFetcherThread-0-7; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes;
RequestInfo: [REDACTED] Possible cause: java.io.EOFException: Received -1
when reading from channel, socket has likely been closed.
(kafka.server.ReplicaFetcherThread)


Massive Logging

This produced around 300GB of new logs in a 24-hour period and rendered the
broker completely unresponsive.

This broker hosts about 500 partitions spanning 40 or so topics (all topics
have a replication factor of 3). One topic contains messages up to 100MB in
size. The remaining topics have messages no larger than 10MB.

It appears that I've hit this bug:
https://issues.apache.org/jira/browse/KAFKA-1196


"Leader Flapping"

I can get the broker to come online without logging massively by reducing
both max.message.bytes and replica.fetch.max.bytes to ~10MB. It then starts
resyncing all but the largest topic.

Unfortunately, it also starts "leader flapping." That is, it continuously
acquires and relinquishes partition leadership. There is nothing of note in
the logs while this is happening, but the consumer offset checker clearly
shows this. The behavior significantly reduces cluster write throughput
(since producers are constantly failing).

The only solution I have is to leave the broker off. Is this a known
"catch-22" situation? Is there anything that can be done to fix it?

Many thanks in advance.

Reply via email to