Hi, I've run into a pretty serious production issue with Kafka 0.8.2, and I'm wondering what my options are.
ReplicaFetcherThread Error I have a broker on a 9-node cluster that went down for a couple of hours. When it came back up, it started spewing constant errors of the following form: INFO Reconnect due to socket error: java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer) [2015-04-09 22:38:54,580] WARN [ReplicaFetcherThread-0-7], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 767; ClientId: ReplicaFetcherThread-0-7; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [REDACTED] Possible cause: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. (kafka.server.ReplicaFetcherThread) Massive Logging This produced around 300GB of new logs in a 24-hour period and rendered the broker completely unresponsive. This broker hosts about 500 partitions spanning 40 or so topics (all topics have a replication factor of 3). One topic contains messages up to 100MB in size. The remaining topics have messages no larger than 10MB. It appears that I've hit this bug: https://issues.apache.org/jira/browse/KAFKA-1196 "Leader Flapping" I can get the broker to come online without logging massively by reducing both max.message.bytes and replica.fetch.max.bytes to ~10MB. It then starts resyncing all but the largest topic. Unfortunately, it also starts "leader flapping." That is, it continuously acquires and relinquishes partition leadership. There is nothing of note in the logs while this is happening, but the consumer offset checker clearly shows this. The behavior significantly reduces cluster write throughput (since producers are constantly failing). The only solution I have is to leave the broker off. Is this a known "catch-22" situation? Is there anything that can be done to fix it? Many thanks in advance.