I assume you are looking at a 'MaxLag' metric, which reports the worst case lag over a set of partitions. Are you consuming multiple partitions, and maybe one of them is stuck?
On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic <otis.gospodne...@gmail.com > wrote: > Hi, > > I've noticed that when we restart our Kafka consumers our consumer lag > metric sometimes looks "weird". > > Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH > > You can see lag go up around 15:00, when some consumers were restarted. > The "weird" thing is that the lag remains flat! > How could it remain flat if consumers are running? (they have enough juice > to catch up!) > > What I think is happening is this: > 1) consumers are initially not really lagging > 2) consumers get stopped > 3) lag grows > 4) consumers get started again > 5) something shifts around...not sure what... > 6) consumers start consuming, and there is actually no lag, but the offsets > written to ZK sometime during 3) don't get updated because after restart > consumers are reading from somewhere else, not from partition(s) whose lag > and offset delta jumped during 3) > > Oh, and: > 7) Kafka JMX still exposes all offsets, event those for partitions that are > no longer being read, so the consumer lag metric remains constant/flat, > even though consumers are actually not lagging on partitions from which > they are now consuming. > > What bugs me is 7), because reading lag info from JMX looks like it's > "lying". > > Does this sound crazy or reasonable? > > If anyone has any comments/advice/suggestions for what one can do about > this, I'm all ears! > > Thanks, > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ >