Hi Jiangjie, There's is nothing of note in the controller log. I've attached that log along with the state change log in the following gist: https://gist.github.com/banker/78b56a3a5246b25ace4c
This represents a 2-hour period on April 15th. Since I've disabled the broker on question (on April 15th), there's been no change to the state-change logs across the entire cluster. While the broker was on, as you can see, state-change.log was growing massively, and the broker was exhibiting the "flapping" I've described. Note that I have auto.leader.rebalance.enable set to true for the entire cluster. Are there any known bugs associated with this feature? Many thanks. On Thu, Apr 16, 2015 at 2:19 PM, Jiangjie Qin <j...@linkedin.com.invalid> wrote: > It seems there are many different symptoms you see... > Maybe we can start from leader flapping issue. Any findings in controller > log? > > Jiangjie (Becket) Qin > > > > On 4/16/15, 12:09 PM, "Kyle Banker" <kyleban...@gmail.com> wrote: > > >Hi, > > > >I've run into a pretty serious production issue with Kafka 0.8.2, and I'm > >wondering what my options are. > > > > > >ReplicaFetcherThread Error > > > >I have a broker on a 9-node cluster that went down for a couple of hours. > >When it came back up, it started spewing constant errors of the following > >form: > > > >INFO Reconnect due to socket error: > >java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer) > >[2015-04-09 22:38:54,580] WARN [ReplicaFetcherThread-0-7], Error in fetch > >Name: FetchRequest; Version: 0; CorrelationId: 767; ClientId: > >ReplicaFetcherThread-0-7; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 > >bytes; > >RequestInfo: [REDACTED] Possible cause: java.io.EOFException: Received -1 > >when reading from channel, socket has likely been closed. > >(kafka.server.ReplicaFetcherThread) > > > > > >Massive Logging > > > >This produced around 300GB of new logs in a 24-hour period and rendered > >the > >broker completely unresponsive. > > > >This broker hosts about 500 partitions spanning 40 or so topics (all > >topics > >have a replication factor of 3). One topic contains messages up to 100MB > >in > >size. The remaining topics have messages no larger than 10MB. > > > >It appears that I've hit this bug: > >https://issues.apache.org/jira/browse/KAFKA-1196 > > > > > >"Leader Flapping" > > > >I can get the broker to come online without logging massively by reducing > >both max.message.bytes and replica.fetch.max.bytes to ~10MB. It then > >starts > >resyncing all but the largest topic. > > > >Unfortunately, it also starts "leader flapping." That is, it continuously > >acquires and relinquishes partition leadership. There is nothing of note > >in > >the logs while this is happening, but the consumer offset checker clearly > >shows this. The behavior significantly reduces cluster write throughput > >(since producers are constantly failing). > > > >The only solution I have is to leave the broker off. Is this a known > >"catch-22" situation? Is there anything that can be done to fix it? > > > >Many thanks in advance. > >