Re: Consumer lag lies - orphaned offsets?

Scott Reynolds Thu, 04 Jun 2015 13:27:12 -0700

I believe the JMX metrics reflect the consumer PRIOR to committing offsets
to Kafka / Zookeeper. But when you query from the command line using the
kafka tools, you are just getting the committed offsets.


On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic <otis.gospodne...@gmail.com
> wrote:

> Hi,
>
> Here's something potentially useful.
>
> 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the "flat
> Lag situation"
>
> 2) I restarted the consumer whose lag is shown in the above graph
>
> 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO
> lag at all!?
>
> So that 81560 Lag value that was stuck in JMX is gone.  Went down to 0.
> Kind of makes sense - the whole consumer was restarted, consumer/java
> process was restarted, everything that was in JMX got reset, and if there
> is truly no consumer lag it makes sense that the values in JMX are 0.
>
> HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what Offset
> Checker shows for this one consumer:
>
> af_servers      spm_cluster_topic-new-cdh      18  220551962
> 220586078       34116
> af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> af_servers      spm_cluster_topic-new-cdh      19  161936440
> 161960377       23937
> af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> af_servers      spm_cluster_topic-new-cdh      20  248308642
> 248340350       31708
> af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> af_servers      spm_cluster_topic-new-cdh      21  259901355
> 259934911       33556
> af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> af_servers      spm_cluster_topic-new-cdh      22  205274547
> 205296950       22403
> af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> af_servers      spm_cluster_topic-new-cdh      23  167756654
> 167780028       23374
> af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> af_servers      spm_cluster_topic-new-cdh      24  357517989
> 357574627       56638
> af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> af_servers      spm_cluster_topic-new-cdh      25  194313232
> 194338154       24922
> af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
>
> You can see Lag is NOT == 0 for any of the partitions!
>
> And yet, look what JMX is showing now -- all Lag for all partitions is
> supposedly 0. Always:
>
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=21
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=21
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=21
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=21
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=22
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=22
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=22
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=22
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=23
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=23
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=23
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=23
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=24
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=24
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=24
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=25
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=25
>       Value java.lang.Object = 0
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=25
>       Value java.lang.Object = 0
>
> Somebody's lying here. :)
>
> Any ideas?
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On Thu, Jun 4, 2015 at 3:54 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com
> > wrote:
>
> > Hi Jason,
> >
> > (note: Kafka 0.8.2. GA)
> > Got some new info below!  Could be a Kafka metrics bug....
> >
> > On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg <j...@squareup.com>
> wrote:
> >
> >> I assume you are looking at a 'MaxLag' metric, which reports the worst
> >> case
> >> lag over a set of partitions.
> >
> >
> > No, we're looking at MBeans that look like this one:
> >
> >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10
> >       Value java.lang.Object = 0
> >
> >
> >> Are you consuming multiple partitions, and maybe one of them is stuck?
> >>
> >
> > Don't think so...  Maybe what we are seeing is a Kafka bug.
> >
> > Here is what we just discovered:
> >
> > Dumped JMX on the consumer and we see this:
> >
> >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
> >       Value java.lang.Object = 81560
> >
> > This 81560 is also what we see in SPM - see the flat line here:
> > https://apps.sematext.com/spm-reports/s/eQ9WhLegW9
> > (you can hover over a datapoint on that 81K line to see server name,
> > topic, and partition)
> >
> > This 81560 is just not going down.  If I look at JMX in 5 minutes, it
> will
> > show the same value - the ConsumerLag of 81560!
> >
> > BUT, this gives different numbers:
> >
> >  /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh
> > kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group
> > af_servers | grep spm_cluster_topic-new-cdh
> >
> > af_servers      spm_cluster_topic-new-cdh      24  *355209634*
> > *355249858*       40224
> > af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0
> >
> > The delta between the bolded numbers is NOT 81560.  And if I run this
> > command N times the delta keeps going down, because the consumer is
> > catching up.  Just like you'd expect.
> >
> > But the JMX number remains constant <== could this be a Kafka metrics/JMX
> > bug?
> >
> > Thanks,
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> >>
> >> On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic <
> >> otis.gospodne...@gmail.com
> >> > wrote:
> >>
> >> > Hi,
> >> >
> >> > I've noticed that when we restart our Kafka consumers our consumer lag
> >> > metric sometimes looks "weird".
> >> >
> >> > Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH
> >> >
> >> > You can see lag go up around 15:00, when some consumers were
> restarted.
> >> > The "weird" thing is that the lag remains flat!
> >> > How could it remain flat if consumers are running? (they have enough
> >> juice
> >> > to catch up!)
> >> >
> >> > What I think is happening is this:
> >> > 1) consumers are initially not really lagging
> >> > 2) consumers get stopped
> >> > 3) lag grows
> >> > 4) consumers get started again
> >> > 5) something shifts around...not sure what...
> >> > 6) consumers start consuming, and there is actually no lag, but the
> >> offsets
> >> > written to ZK sometime during 3) don't get updated because after
> restart
> >> > consumers are reading from somewhere else, not from partition(s) whose
> >> lag
> >> > and offset delta jumped during 3)
> >> >
> >> > Oh, and:
> >> > 7) Kafka JMX still exposes all offsets, event those for partitions
> that
> >> are
> >> > no longer being read, so the consumer lag metric remains
> constant/flat,
> >> > even though consumers are actually not lagging on partitions from
> which
> >> > they are now consuming.
> >> >
> >> > What bugs me is 7), because reading lag info from JMX looks like it's
> >> > "lying".
> >> >
> >> > Does this sound crazy or reasonable?
> >> >
> >> > If anyone has any comments/advice/suggestions for what one can do
> about
> >> > this, I'm all ears!
> >> >
> >> > Thanks,
> >> > Otis
> >> > --
> >> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> > Solr & Elasticsearch Support * http://sematext.com/
> >> >
> >>
> >
> >
>

Re: Consumer lag lies - orphaned offsets?

Reply via email to