Re: Consumer lag lies - orphaned offsets?

Otis Gospodnetic Thu, 04 Jun 2015 12:55:39 -0700

Hi Jason,

(note: Kafka 0.8.2. GA)
Got some new info below!  Could be a Kafka metrics bug....


On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg <j...@squareup.com> wrote:

> I assume you are looking at a 'MaxLag' metric, which reports the worst case
> lag over a set of partitions.


No, we're looking at MBeans that look like this one:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10
      Value java.lang.Object = 0


> Are you consuming multiple partitions, and maybe one of them is stuck?
>

Don't think so...  Maybe what we are seeing is a Kafka bug.

Here is what we just discovered:

Dumped JMX on the consumer and we see this:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
      Value java.lang.Object = 81560

This 81560 is also what we see in SPM - see the flat line here:
https://apps.sematext.com/spm-reports/s/eQ9WhLegW9
(you can hover over a datapoint on that 81K line to see server name, topic,
and partition)

This 81560 is just not going down.  If I look at JMX in 5 minutes, it will
show the same value - the ConsumerLag of 81560!

BUT, this gives different numbers:

 /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh
kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group
af_servers | grep spm_cluster_topic-new-cdh

af_servers      spm_cluster_topic-new-cdh      24  *355209634*
*355249858*       40224
af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0

The delta between the bolded numbers is NOT 81560.  And if I run this
command N times the delta keeps going down, because the consumer is
catching up.  Just like you'd expect.

But the JMX number remains constant <== could this be a Kafka metrics/JMX
bug?

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



>
> On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com
> > wrote:
>
> > Hi,
> >
> > I've noticed that when we restart our Kafka consumers our consumer lag
> > metric sometimes looks "weird".
> >
> > Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH
> >
> > You can see lag go up around 15:00, when some consumers were restarted.
> > The "weird" thing is that the lag remains flat!
> > How could it remain flat if consumers are running? (they have enough
> juice
> > to catch up!)
> >
> > What I think is happening is this:
> > 1) consumers are initially not really lagging
> > 2) consumers get stopped
> > 3) lag grows
> > 4) consumers get started again
> > 5) something shifts around...not sure what...
> > 6) consumers start consuming, and there is actually no lag, but the
> offsets
> > written to ZK sometime during 3) don't get updated because after restart
> > consumers are reading from somewhere else, not from partition(s) whose
> lag
> > and offset delta jumped during 3)
> >
> > Oh, and:
> > 7) Kafka JMX still exposes all offsets, event those for partitions that
> are
> > no longer being read, so the consumer lag metric remains constant/flat,
> > even though consumers are actually not lagging on partitions from which
> > they are now consuming.
> >
> > What bugs me is 7), because reading lag info from JMX looks like it's
> > "lying".
> >
> > Does this sound crazy or reasonable?
> >
> > If anyone has any comments/advice/suggestions for what one can do about
> > this, I'm all ears!
> >
> > Thanks,
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
>

Re: Consumer lag lies - orphaned offsets?

Reply via email to