Hi Jason, (note: Kafka 0.8.2. GA) Got some new info below! Could be a Kafka metrics bug....
On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg <j...@squareup.com> wrote: > I assume you are looking at a 'MaxLag' metric, which reports the worst case > lag over a set of partitions. No, we're looking at MBeans that look like this one: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10 Value java.lang.Object = 0 > Are you consuming multiple partitions, and maybe one of them is stuck? > Don't think so... Maybe what we are seeing is a Kafka bug. Here is what we just discovered: Dumped JMX on the consumer and we see this: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24 Value java.lang.Object = 81560 This 81560 is also what we see in SPM - see the flat line here: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 (you can hover over a datapoint on that 81K line to see server name, topic, and partition) This 81560 is just not going down. If I look at JMX in 5 minutes, it will show the same value - the ConsumerLag of 81560! BUT, this gives different numbers: /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group af_servers | grep spm_cluster_topic-new-cdh af_servers spm_cluster_topic-new-cdh 24 *355209634* *355249858* 40224 af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0 The delta between the bolded numbers is NOT 81560. And if I run this command N times the delta keeps going down, because the consumer is catching up. Just like you'd expect. But the JMX number remains constant <== could this be a Kafka metrics/JMX bug? Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ > > On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com > > wrote: > > > Hi, > > > > I've noticed that when we restart our Kafka consumers our consumer lag > > metric sometimes looks "weird". > > > > Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH > > > > You can see lag go up around 15:00, when some consumers were restarted. > > The "weird" thing is that the lag remains flat! > > How could it remain flat if consumers are running? (they have enough > juice > > to catch up!) > > > > What I think is happening is this: > > 1) consumers are initially not really lagging > > 2) consumers get stopped > > 3) lag grows > > 4) consumers get started again > > 5) something shifts around...not sure what... > > 6) consumers start consuming, and there is actually no lag, but the > offsets > > written to ZK sometime during 3) don't get updated because after restart > > consumers are reading from somewhere else, not from partition(s) whose > lag > > and offset delta jumped during 3) > > > > Oh, and: > > 7) Kafka JMX still exposes all offsets, event those for partitions that > are > > no longer being read, so the consumer lag metric remains constant/flat, > > even though consumers are actually not lagging on partitions from which > > they are now consuming. > > > > What bugs me is 7), because reading lag info from JMX looks like it's > > "lying". > > > > Does this sound crazy or reasonable? > > > > If anyone has any comments/advice/suggestions for what one can do about > > this, I'm all ears! > > > > Thanks, > > Otis > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > >