Hi Jiangjie,

There's is nothing of note in the controller log. I've attached that log
along with the state change log in the following gist:
https://gist.github.com/banker/78b56a3a5246b25ace4c

This represents a 2-hour period on April 15th.

Since I've disabled the broker on question (on April 15th), there's been no
change to the state-change logs across the entire cluster. While the broker
was on, as you can see, state-change.log was growing massively, and the
broker was exhibiting the "flapping" I've described.

Note that I have auto.leader.rebalance.enable set to true for the entire
cluster. Are there any known bugs associated with this feature?

Many thanks.

On Thu, Apr 16, 2015 at 2:19 PM, Jiangjie Qin <j...@linkedin.com.invalid>
wrote:

> It seems there are many different symptoms you see...
> Maybe we can start from leader flapping issue. Any findings in controller
> log?
>
> Jiangjie (Becket) Qin
>
>
>
> On 4/16/15, 12:09 PM, "Kyle Banker" <kyleban...@gmail.com> wrote:
>
> >Hi,
> >
> >I've run into a pretty serious production issue with Kafka 0.8.2, and I'm
> >wondering what my options are.
> >
> >
> >ReplicaFetcherThread Error
> >
> >I have a broker on a 9-node cluster that went down for a couple of hours.
> >When it came back up, it started spewing constant errors of the following
> >form:
> >
> >INFO Reconnect due to socket error:
> >java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer)
> >[2015-04-09 22:38:54,580] WARN [ReplicaFetcherThread-0-7], Error in fetch
> >Name: FetchRequest; Version: 0; CorrelationId: 767; ClientId:
> >ReplicaFetcherThread-0-7; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
> >bytes;
> >RequestInfo: [REDACTED] Possible cause: java.io.EOFException: Received -1
> >when reading from channel, socket has likely been closed.
> >(kafka.server.ReplicaFetcherThread)
> >
> >
> >Massive Logging
> >
> >This produced around 300GB of new logs in a 24-hour period and rendered
> >the
> >broker completely unresponsive.
> >
> >This broker hosts about 500 partitions spanning 40 or so topics (all
> >topics
> >have a replication factor of 3). One topic contains messages up to 100MB
> >in
> >size. The remaining topics have messages no larger than 10MB.
> >
> >It appears that I've hit this bug:
> >https://issues.apache.org/jira/browse/KAFKA-1196
> >
> >
> >"Leader Flapping"
> >
> >I can get the broker to come online without logging massively by reducing
> >both max.message.bytes and replica.fetch.max.bytes to ~10MB. It then
> >starts
> >resyncing all but the largest topic.
> >
> >Unfortunately, it also starts "leader flapping." That is, it continuously
> >acquires and relinquishes partition leadership. There is nothing of note
> >in
> >the logs while this is happening, but the consumer offset checker clearly
> >shows this. The behavior significantly reduces cluster write throughput
> >(since producers are constantly failing).
> >
> >The only solution I have is to leave the broker off. Is this a known
> >"catch-22" situation? Is there anything that can be done to fix it?
> >
> >Many thanks in advance.
>
>

Reply via email to