what code do you use to poll the jmx end-point? On Thu, Jun 4, 2015 at 4:55 PM, Otis Gospodnetić <otis.gospodne...@gmail.com > wrote:
> Hi, > > On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds <sreyno...@twilio.com> > wrote: > > > I believe the JMX metrics reflect the consumer PRIOR to committing > offsets > > to Kafka / Zookeeper. But when you query from the command line using the > > kafka tools, you are just getting the committed offsets. > > > > Even if that were the case, and maybe it is, it doesn't explain why the > ConsumerLag in JMX often remains *completely constant*.....forever... until > the consumer is restarted. You see what I mean? > > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > > > On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic < > > otis.gospodne...@gmail.com > > > wrote: > > > > > Hi, > > > > > > Here's something potentially useful. > > > > > > 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the > > "flat > > > Lag situation" > > > > > > 2) I restarted the consumer whose lag is shown in the above graph > > > > > > 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - > > NO > > > lag at all!? > > > > > > So that 81560 Lag value that was stuck in JMX is gone. Went down to 0. > > > Kind of makes sense - the whole consumer was restarted, consumer/java > > > process was restarted, everything that was in JMX got reset, and if > there > > > is truly no consumer lag it makes sense that the values in JMX are 0. > > > > > > HOWEVER, is the Lag *really* always *exactly* 0? No way. Look what > > Offset > > > Checker shows for this one consumer: > > > > > > af_servers spm_cluster_topic-new-cdh 18 220551962 > > > 220586078 34116 > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > > af_servers spm_cluster_topic-new-cdh 19 161936440 > > > 161960377 23937 > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > > af_servers spm_cluster_topic-new-cdh 20 248308642 > > > 248340350 31708 > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > > af_servers spm_cluster_topic-new-cdh 21 259901355 > > > 259934911 33556 > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > > af_servers spm_cluster_topic-new-cdh 22 205274547 > > > 205296950 22403 > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > > af_servers spm_cluster_topic-new-cdh 23 167756654 > > > 167780028 23374 > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > > af_servers spm_cluster_topic-new-cdh 24 357517989 > > > 357574627 56638 > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > > af_servers spm_cluster_topic-new-cdh 25 194313232 > > > 194338154 24922 > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 > > > > > > You can see Lag is NOT == 0 for any of the partitions! > > > > > > And yet, look what JMX is showing now -- all Lag for all partitions is > > > supposedly 0. Always: > > > > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=21 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=21 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=21 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=21 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=22 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=22 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=22 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=22 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=23 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=23 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=23 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=23 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=24 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=24 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=24 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=25 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=25 > > > Value java.lang.Object = 0 > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=25 > > > Value java.lang.Object = 0 > > > > > > Somebody's lying here. :) > > > > > > Any ideas? > > > > > > Thanks, > > > Otis > > > -- > > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > > > > On Thu, Jun 4, 2015 at 3:54 PM, Otis Gospodnetic < > > > otis.gospodne...@gmail.com > > > > wrote: > > > > > > > Hi Jason, > > > > > > > > (note: Kafka 0.8.2. GA) > > > > Got some new info below! Could be a Kafka metrics bug.... > > > > > > > > On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg <j...@squareup.com> > > > wrote: > > > > > > > >> I assume you are looking at a 'MaxLag' metric, which reports the > worst > > > >> case > > > >> lag over a set of partitions. > > > > > > > > > > > > No, we're looking at MBeans that look like this one: > > > > > > > > > > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10 > > > > Value java.lang.Object = 0 > > > > > > > > > > > >> Are you consuming multiple partitions, and maybe one of them is > stuck? > > > >> > > > > > > > > Don't think so... Maybe what we are seeing is a Kafka bug. > > > > > > > > Here is what we just discovered: > > > > > > > > Dumped JMX on the consumer and we see this: > > > > > > > > > > > > > > > > > > kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24 > > > > Value java.lang.Object = 81560 > > > > > > > > This 81560 is also what we see in SPM - see the flat line here: > > > > https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 > > > > (you can hover over a datapoint on that 81K line to see server name, > > > > topic, and partition) > > > > > > > > This 81560 is just not going down. If I look at JMX in 5 minutes, it > > > will > > > > show the same value - the ConsumerLag of 81560! > > > > > > > > BUT, this gives different numbers: > > > > > > > > /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh > > > > kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group > > > > af_servers | grep spm_cluster_topic-new-cdh > > > > > > > > af_servers spm_cluster_topic-new-cdh 24 *355209634* > > > > *355249858* 40224 > > > > af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0 > > > > > > > > The delta between the bolded numbers is NOT 81560. And if I run this > > > > command N times the delta keeps going down, because the consumer is > > > > catching up. Just like you'd expect. > > > > > > > > But the JMX number remains constant <== could this be a Kafka > > metrics/JMX > > > > bug? > > > > > > > > Thanks, > > > > Otis > > > > -- > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log > Management > > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > > > > > > > >> > > > >> On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic < > > > >> otis.gospodne...@gmail.com > > > >> > wrote: > > > >> > > > >> > Hi, > > > >> > > > > >> > I've noticed that when we restart our Kafka consumers our consumer > > lag > > > >> > metric sometimes looks "weird". > > > >> > > > > >> > Here's an example: > > https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH > > > >> > > > > >> > You can see lag go up around 15:00, when some consumers were > > > restarted. > > > >> > The "weird" thing is that the lag remains flat! > > > >> > How could it remain flat if consumers are running? (they have > enough > > > >> juice > > > >> > to catch up!) > > > >> > > > > >> > What I think is happening is this: > > > >> > 1) consumers are initially not really lagging > > > >> > 2) consumers get stopped > > > >> > 3) lag grows > > > >> > 4) consumers get started again > > > >> > 5) something shifts around...not sure what... > > > >> > 6) consumers start consuming, and there is actually no lag, but > the > > > >> offsets > > > >> > written to ZK sometime during 3) don't get updated because after > > > restart > > > >> > consumers are reading from somewhere else, not from partition(s) > > whose > > > >> lag > > > >> > and offset delta jumped during 3) > > > >> > > > > >> > Oh, and: > > > >> > 7) Kafka JMX still exposes all offsets, event those for partitions > > > that > > > >> are > > > >> > no longer being read, so the consumer lag metric remains > > > constant/flat, > > > >> > even though consumers are actually not lagging on partitions from > > > which > > > >> > they are now consuming. > > > >> > > > > >> > What bugs me is 7), because reading lag info from JMX looks like > > it's > > > >> > "lying". > > > >> > > > > >> > Does this sound crazy or reasonable? > > > >> > > > > >> > If anyone has any comments/advice/suggestions for what one can do > > > about > > > >> > this, I'm all ears! > > > >> > > > > >> > Thanks, > > > >> > Otis > > > >> > -- > > > >> > Monitoring * Alerting * Anomaly Detection * Centralized Log > > Management > > > >> > Solr & Elasticsearch Support * http://sematext.com/ > > > >> > > > > >> > > > > > > > > > > > > > >