I believe the JMX metrics reflect the consumer PRIOR to committing offsets to Kafka / Zookeeper. But when you query from the command line using the kafka tools, you are just getting the committed offsets.
On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic <otis.gospodne...@gmail.com > wrote: > Hi, > > Here's something potentially useful. > > 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the "flat > Lag situation" > > 2) I restarted the consumer whose lag is shown in the above graph > > 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO > lag at all!? > > So that 81560 Lag value that was stuck in JMX is gone. Went down to 0. > Kind of makes sense - the whole consumer was restarted, consumer/java > process was restarted, everything that was in JMX got reset, and if there > is truly no consumer lag it makes sense that the values in JMX are 0. > > HOWEVER, is the Lag *really* always *exactly* 0? No way. Look what Offset > Checker shows for this one consumer: > > af_servers spm_cluster_topic-new-cdh 18 220551962 > 220586078 34116 > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > af_servers spm_cluster_topic-new-cdh 19 161936440 > 161960377 23937 > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > af_servers spm_cluster_topic-new-cdh 20 248308642 > 248340350 31708 > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > af_servers spm_cluster_topic-new-cdh 21 259901355 > 259934911 33556 > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > af_servers spm_cluster_topic-new-cdh 22 205274547 > 205296950 22403 > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > af_servers spm_cluster_topic-new-cdh 23 167756654 > 167780028 23374 > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > af_servers spm_cluster_topic-new-cdh 24 357517989 > 357574627 56638 > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > af_servers spm_cluster_topic-new-cdh 25 194313232 > 194338154 24922 > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > You can see Lag is NOT == 0 for any of the partitions! > > And yet, look what JMX is showing now -- all Lag for all partitions is > supposedly 0. Always: > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=21 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=21 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=21 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=21 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=22 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=22 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=22 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=22 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=23 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=23 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=23 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=23 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=24 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=24 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=24 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=25 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=25 > Value java.lang.Object = 0 > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=25 > Value java.lang.Object = 0 > > Somebody's lying here. :) > > Any ideas? > > Thanks, > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > > On Thu, Jun 4, 2015 at 3:54 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com > > wrote: > > > Hi Jason, > > > > (note: Kafka 0.8.2. GA) > > Got some new info below! Could be a Kafka metrics bug.... > > > > On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg <j...@squareup.com> > wrote: > > > >> I assume you are looking at a 'MaxLag' metric, which reports the worst > >> case > >> lag over a set of partitions. > > > > > > No, we're looking at MBeans that look like this one: > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10 > > Value java.lang.Object = 0 > > > > > >> Are you consuming multiple partitions, and maybe one of them is stuck? > >> > > > > Don't think so... Maybe what we are seeing is a Kafka bug. > > > > Here is what we just discovered: > > > > Dumped JMX on the consumer and we see this: > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24 > > Value java.lang.Object = 81560 > > > > This 81560 is also what we see in SPM - see the flat line here: > > https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 > > (you can hover over a datapoint on that 81K line to see server name, > > topic, and partition) > > > > This 81560 is just not going down. If I look at JMX in 5 minutes, it > will > > show the same value - the ConsumerLag of 81560! > > > > BUT, this gives different numbers: > > > > /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh > > kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group > > af_servers | grep spm_cluster_topic-new-cdh > > > > af_servers spm_cluster_topic-new-cdh 24 *355209634* > > *355249858* 40224 > > af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0 > > > > The delta between the bolded numbers is NOT 81560. And if I run this > > command N times the delta keeps going down, because the consumer is > > catching up. Just like you'd expect. > > > > But the JMX number remains constant <== could this be a Kafka metrics/JMX > > bug? > > > > Thanks, > > Otis > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > >> > >> On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic < > >> otis.gospodne...@gmail.com > >> > wrote: > >> > >> > Hi, > >> > > >> > I've noticed that when we restart our Kafka consumers our consumer lag > >> > metric sometimes looks "weird". > >> > > >> > Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH > >> > > >> > You can see lag go up around 15:00, when some consumers were > restarted. > >> > The "weird" thing is that the lag remains flat! > >> > How could it remain flat if consumers are running? (they have enough > >> juice > >> > to catch up!) > >> > > >> > What I think is happening is this: > >> > 1) consumers are initially not really lagging > >> > 2) consumers get stopped > >> > 3) lag grows > >> > 4) consumers get started again > >> > 5) something shifts around...not sure what... > >> > 6) consumers start consuming, and there is actually no lag, but the > >> offsets > >> > written to ZK sometime during 3) don't get updated because after > restart > >> > consumers are reading from somewhere else, not from partition(s) whose > >> lag > >> > and offset delta jumped during 3) > >> > > >> > Oh, and: > >> > 7) Kafka JMX still exposes all offsets, event those for partitions > that > >> are > >> > no longer being read, so the consumer lag metric remains > constant/flat, > >> > even though consumers are actually not lagging on partitions from > which > >> > they are now consuming. > >> > > >> > What bugs me is 7), because reading lag info from JMX looks like it's > >> > "lying". > >> > > >> > Does this sound crazy or reasonable? > >> > > >> > If anyone has any comments/advice/suggestions for what one can do > about > >> > this, I'm all ears! > >> > > >> > Thanks, > >> > Otis > >> > -- > >> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > >> > Solr & Elasticsearch Support * http://sematext.com/ > >> > > >> > > > > >