I assume you are looking at a 'MaxLag' metric, which reports the worst case
lag over a set of partitions.  Are you consuming multiple partitions, and
maybe one of them is stuck?

On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic <otis.gospodne...@gmail.com
> wrote:

> Hi,
>
> I've noticed that when we restart our Kafka consumers our consumer lag
> metric sometimes looks "weird".
>
> Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH
>
> You can see lag go up around 15:00, when some consumers were restarted.
> The "weird" thing is that the lag remains flat!
> How could it remain flat if consumers are running? (they have enough juice
> to catch up!)
>
> What I think is happening is this:
> 1) consumers are initially not really lagging
> 2) consumers get stopped
> 3) lag grows
> 4) consumers get started again
> 5) something shifts around...not sure what...
> 6) consumers start consuming, and there is actually no lag, but the offsets
> written to ZK sometime during 3) don't get updated because after restart
> consumers are reading from somewhere else, not from partition(s) whose lag
> and offset delta jumped during 3)
>
> Oh, and:
> 7) Kafka JMX still exposes all offsets, event those for partitions that are
> no longer being read, so the consumer lag metric remains constant/flat,
> even though consumers are actually not lagging on partitions from which
> they are now consuming.
>
> What bugs me is 7), because reading lag info from JMX looks like it's
> "lying".
>
> Does this sound crazy or reasonable?
>
> If anyone has any comments/advice/suggestions for what one can do about
> this, I'm all ears!
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>

Reply via email to