Carl,

It will help if you can list the steps to reproduce this issue starting
from a fresh installation. Your setup, the way it stands, seems to have
gone through some config and state changes.

Thanks,
Neha


On Wed, Feb 5, 2014 at 5:17 PM, Joel Koshy <jjkosh...@gmail.com> wrote:

> On Wed, Feb 05, 2014 at 04:51:16PM -0800, Carl Lerche wrote:
> > So, I tried enabling debug logging, I also made some tweaks to the
> > config (which I probably shouldn't have) and craziness happened.
> >
> > First, some more context. Besides the very high network traffic, we
> > were seeing some other issues that we were not focusing on yet.
> >
> > * Even though the log retention was set to 50GB & 24 hours, data logs
> > were getting cleaned up far quicker quicker. I'm not entirely sure how
> > much quicker, but there was definitely far less than 12 hours and 1GB
> > of data.
> >
> > * Kafka was not properly balanced. We had 3 servers, and only 2 of
> > them were partition leaders. One server was a replica for all
> > partitions. We tried to run a rebalance command, but it did not work.
> > We were going to investigate later.
>
> Were any of the brokers down for an extended period? If the preferred
> replica election command failed it could be because the preferred
> replica was catching up (which could explain the higher than expected
> network traffic). Do you monitor the under-replicated partitions count
> on your cluster? If you have that data it could help confirm this.
>
> Joel
>
> >
> > So, after restarting all the kafkas, something happened with the
> > offsets. The offsets that our consumers had no longer existed. It
> > looks like somehow all the contents was lost? The logs show many
> > exceptions like:
> >
> > `Request for offset 770354 but we only have log segments in the range
> > 759234 to 759838.`
> >
> > So, I reset all the consumer offsets to the head of the queue as I did
> > not know of anything better to do. Once the dust settled, all the
> > issues we were seeing vanished. Communication between Kafka nodes
> > appear to be normal, Kafka was able to rebalance, and hopefully log
> > retention will be normal.
> >
> > I am unsure what happened or how to get more debug information.
> >
> > On Wed, Feb 5, 2014 at 12:31 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
> > > Can you enable DEBUG logging in log4j and see what requests are coming
> in?
> > >
> > > -Jay
> > >
> > >
> > > On Tue, Feb 4, 2014 at 9:51 PM, Carl Lerche <m...@carllerche.com> wrote:
> > >
> > >> Hi Jay,
> > >>
> > >> I do not believe that I have changed the replica.fetch.wait.max.ms
> > >> setting. Here I have included the kafka config as well as a snapshot
> > >> of jnettop from one of the servers.
> > >>
> > >> https://gist.github.com/carllerche/4f2cf0f0f6d1e891f482
> > >>
> > >> The bottom row (89.9K/s) is the producer (it lives on a Kafka server).
> > >> The top two rows are Kafkas on other servers, you can see the combined
> > >> throughput is ~80MB/s
> > >>
> > >> On Tue, Feb 4, 2014 at 9:36 PM, Jay Kreps <jay.kr...@gmail.com>
> wrote:
> > >> > No this is not normal.
> > >> >
> > >> > Checking twice a second (using 500ms default) for new data shouldn't
> > >> cause
> > >> > high network traffic (that should be like < 1KB of overhead). I
> don't
> > >> think
> > >> > that explains things. Is it possible that setting has been
> overridden?
> > >> >
> > >> > -Jay
> > >> >
> > >> >
> > >> > On Tue, Feb 4, 2014 at 9:25 PM, Guozhang Wang <wangg...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> Hi Carl,
> > >> >>
> > >> >> For each partition the follower will also fetch data from the
> leader
> > >> >> replica, even if there is no new data in the leader replicas.
> > >> >>
> > >> >> One thing you can try to increase replica.fetch.wait.max.ms(default
> > >> value
> > >> >> 500ms) so that the followers's fetching request frequency to the
> leader
> > >> can
> > >> >> be reduced, and see if that has some effect on the traffic.
> > >> >>
> > >> >> Guozhang
> > >> >>
> > >> >>
> > >> >> On Tue, Feb 4, 2014 at 8:46 PM, Carl Lerche <m...@carllerche.com>
> wrote:
> > >> >>
> > >> >> > Hello,
> > >> >> >
> > >> >> > I'm running a 0.8.0 Kafka cluster of 3 servers. The service that
> it is
> > >> >> > for is not in full production yet, so the data written to
> cluster is
> > >> >> > minimal (seems to average between 100kb/s -> 300kb/s per
> server). I
> > >> >> > have configured Kafka to have a 3 replicas. I am noticing that
> each
> > >> >> > Kafka server is talking to all the others at a data rate of
> 40MB/s for
> > >> >> > each server (so, a total of 80MB/s for each server). This
> > >> >> > communication is constant.
> > >> >> >
> > >> >> > Is this normal? This seems like very strange behavior and I'm not
> > >> >> > exactly sure how to debug.
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Carl
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> -- Guozhang
> > >> >>
> > >>
>
>

Reply via email to