Re: Consumer lag lies - orphaned offsets?

Todd Palino Thu, 04 Jun 2015 18:57:07 -0700

I just sent out a separate email about the project that I've been working
on, Burrow, to change the way we're monitoring consumer status.


Like Joel said, the mbeans can be a little tricky to work with. Internally
at LinkedIn, we've always depended more on using a tool like the Consumer
Offset Checker, where we check the consumer offsets in Zookeeper or Kafka,
check the broker's offsets, and calculate the difference as lag. This has
worked, but with wildcard consumers like mirror maker we had to resort to
spot checking one or two topics, since we couldn't feasibly monitor 20k or
30k partitions per consumer. We also had to set thresholds for lag, which
could generate false alerts when there is a spike in traffic, or could take
a long time to alert.

Burrow attacks the problem a little differently. It still monitors the
consumer and broker offsets, but it monitors the consumer offsets by
consuming the __consumer_offsets topic, so it checks every consumer and
every partition (instead of spot checking). It then looks at the consumer
over a sliding window (we're working with 10 offset commits, or 10 minutes,
right now) and uses a set of rules to determine whether the consumer is in
a good or bad state (described at
https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules). You
can then check the consumer via an HTTP call, or monitor it continuously
and send emails and/or calls to an external HTTP endpoint (like another
alert/notification system).

It's up on GitHub at https://github.com/linkedin/Burrow if you would like
to take a look and give it a try.

-Todd


On Thu, Jun 4, 2015 at 5:52 PM, Joel Koshy <[email protected]> wrote:

> Hi Otis,
>
> Yes this is a limitation in the old consumer. i.e., a number of
> per-topic/partition mbeans remain even on a rebalance. Those need to
> be de-registered. So if you stop consuming from some partition after a
> rebalance that lag mbean currently remain which is why it remains
> flat.  This is a known issue.
>
> On the restart, the lag goes down to zero because - well the mbeans
> get recreated and the consumer starts fetching. If the fetch request
> reads up to the end of the log then the mbean will report zero. Your
> actual committed offset may be behind though which is why your true
> lag is > 0.
>
> The lag mbeans are useful, but have a number of limitations - it
> depends on active fetches in progress; it also does not exactly
> correspond with your actual processed (and committed) offset. The most
> reliable way to monitor application lag is to use the committed
> offsets and the current log end offsets. Todd has been doing a lot of
> interesting work in making lag monitoring less painful and can comment
> more.
>
> Joel
>
> On Thu, Jun 04, 2015 at 04:55:44PM -0400, Otis Gospodnetić wrote:
> > Hi,
> >
> > On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds <[email protected]>
> wrote:
> >
> > > I believe the JMX metrics reflect the consumer PRIOR to committing
> offsets
> > > to Kafka / Zookeeper. But when you query from the command line using
> the
> > > kafka tools, you are just getting the committed offsets.
> > >
> >
> > Even if that were the case, and maybe it is, it doesn't explain why the
> > ConsumerLag in JMX often remains *completely constant*.....forever...
> until
> > the consumer is restarted.  You see what I mean?
> >
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> > > On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic <
> > > [email protected]
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Here's something potentially useful.
> > > >
> > > > 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the
> > > "flat
> > > > Lag situation"
> > > >
> > > > 2) I restarted the consumer whose lag is shown in the above graph
> > > >
> > > > 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms
> -
> > > NO
> > > > lag at all!?
> > > >
> > > > So that 81560 Lag value that was stuck in JMX is gone.  Went down to
> 0.
> > > > Kind of makes sense - the whole consumer was restarted, consumer/java
> > > > process was restarted, everything that was in JMX got reset, and if
> there
> > > > is truly no consumer lag it makes sense that the values in JMX are 0.
> > > >
> > > > HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what
> > > Offset
> > > > Checker shows for this one consumer:
> > > >
> > > > af_servers      spm_cluster_topic-new-cdh      18  220551962
> > > > 220586078       34116
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      19  161936440
> > > > 161960377       23937
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      20  248308642
> > > > 248340350       31708
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      21  259901355
> > > > 259934911       33556
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      22  205274547
> > > > 205296950       22403
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      23  167756654
> > > > 167780028       23374
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      24  357517989
> > > > 357574627       56638
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      25  194313232
> > > > 194338154       24922
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > >
> > > > You can see Lag is NOT == 0 for any of the partitions!
> > > >
> > > > And yet, look what JMX is showing now -- all Lag for all partitions
> is
> > > > supposedly 0. Always:
> > > >
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=21
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=21
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=21
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=21
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=22
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=22
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=22
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=22
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=23
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=23
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=23
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=23
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=24
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=24
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=24
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=25
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=25
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=25
> > > >       Value java.lang.Object = 0
> > > >
> > > > Somebody's lying here. :)
> > > >
> > > > Any ideas?
> > > >
> > > > Thanks,
> > > > Otis
> > > > --
> > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> Management
> > > > Solr & Elasticsearch Support * http://sematext.com/
> > > >
> > > >
> > > >
> > > > On Thu, Jun 4, 2015 at 3:54 PM, Otis Gospodnetic <
> > > > [email protected]
> > > > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > (note: Kafka 0.8.2. GA)
> > > > > Got some new info below!  Could be a Kafka metrics bug....
> > > > >
> > > > > On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg <[email protected]>
> > > > wrote:
> > > > >
> > > > >> I assume you are looking at a 'MaxLag' metric, which reports the
> worst
> > > > >> case
> > > > >> lag over a set of partitions.
> > > > >
> > > > >
> > > > > No, we're looking at MBeans that look like this one:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10
> > > > >       Value java.lang.Object = 0
> > > > >
> > > > >
> > > > >> Are you consuming multiple partitions, and maybe one of them is
> stuck?
> > > > >>
> > > > >
> > > > > Don't think so...  Maybe what we are seeing is a Kafka bug.
> > > > >
> > > > > Here is what we just discovered:
> > > > >
> > > > > Dumped JMX on the consumer and we see this:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
> > > > >       Value java.lang.Object = 81560
> > > > >
> > > > > This 81560 is also what we see in SPM - see the flat line here:
> > > > > https://apps.sematext.com/spm-reports/s/eQ9WhLegW9
> > > > > (you can hover over a datapoint on that 81K line to see server
> name,
> > > > > topic, and partition)
> > > > >
> > > > > This 81560 is just not going down.  If I look at JMX in 5 minutes,
> it
> > > > will
> > > > > show the same value - the ConsumerLag of 81560!
> > > > >
> > > > > BUT, this gives different numbers:
> > > > >
> > > > >  /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh
> > > > > kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181
> --group
> > > > > af_servers | grep spm_cluster_topic-new-cdh
> > > > >
> > > > > af_servers      spm_cluster_topic-new-cdh      24  *355209634*
> > > > > *355249858*       40224
> > > > > af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0
> > > > >
> > > > > The delta between the bolded numbers is NOT 81560.  And if I run
> this
> > > > > command N times the delta keeps going down, because the consumer is
> > > > > catching up.  Just like you'd expect.
> > > > >
> > > > > But the JMX number remains constant <== could this be a Kafka
> > > metrics/JMX
> > > > > bug?
> > > > >
> > > > > Thanks,
> > > > > Otis
> > > > > --
> > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> Management
> > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > >
> > > > >
> > > > >
> > > > >>
> > > > >> On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic <
> > > > >> [email protected]
> > > > >> > wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > I've noticed that when we restart our Kafka consumers our
> consumer
> > > lag
> > > > >> > metric sometimes looks "weird".
> > > > >> >
> > > > >> > Here's an example:
> > > https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH
> > > > >> >
> > > > >> > You can see lag go up around 15:00, when some consumers were
> > > > restarted.
> > > > >> > The "weird" thing is that the lag remains flat!
> > > > >> > How could it remain flat if consumers are running? (they have
> enough
> > > > >> juice
> > > > >> > to catch up!)
> > > > >> >
> > > > >> > What I think is happening is this:
> > > > >> > 1) consumers are initially not really lagging
> > > > >> > 2) consumers get stopped
> > > > >> > 3) lag grows
> > > > >> > 4) consumers get started again
> > > > >> > 5) something shifts around...not sure what...
> > > > >> > 6) consumers start consuming, and there is actually no lag, but
> the
> > > > >> offsets
> > > > >> > written to ZK sometime during 3) don't get updated because after
> > > > restart
> > > > >> > consumers are reading from somewhere else, not from partition(s)
> > > whose
> > > > >> lag
> > > > >> > and offset delta jumped during 3)
> > > > >> >
> > > > >> > Oh, and:
> > > > >> > 7) Kafka JMX still exposes all offsets, event those for
> partitions
> > > > that
> > > > >> are
> > > > >> > no longer being read, so the consumer lag metric remains
> > > > constant/flat,
> > > > >> > even though consumers are actually not lagging on partitions
> from
> > > > which
> > > > >> > they are now consuming.
> > > > >> >
> > > > >> > What bugs me is 7), because reading lag info from JMX looks like
> > > it's
> > > > >> > "lying".
> > > > >> >
> > > > >> > Does this sound crazy or reasonable?
> > > > >> >
> > > > >> > If anyone has any comments/advice/suggestions for what one can
> do
> > > > about
> > > > >> > this, I'm all ears!
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Otis
> > > > >> > --
> > > > >> > Monitoring * Alerting * Anomaly Detection * Centralized Log
> > > Management
> > > > >> > Solr & Elasticsearch Support * http://sematext.com/
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
>
>

Re: Consumer lag lies - orphaned offsets?

Reply via email to