Re: Consumer lag lies - orphaned offsets?

2015-06-05 Thread Joel Koshy
On Fri, Jun 05, 2015 at 12:53:00AM -0400, Otis Gospodnetić wrote:
 Hi Joel,
 
 On Thu, Jun 4, 2015 at 8:52 PM, Joel Koshy jjkosh...@gmail.com wrote:
 
  Hi Otis,
 
  Yes this is a limitation in the old consumer. i.e., a number of
  per-topic/partition mbeans remain even on a rebalance. Those need to
  be de-registered. So if you stop consuming from some partition after a
  rebalance that lag mbean currently remain which is why it remains
  flat.  This is a known issue.
 
 
 I see.  Is / should this be considered a bug?  Something worth fixing for
 0.8.3?

Yes I would call it a bug, but it hasn't been a high priority so far
mainly because (I think) most users monitor lag with committed
offsets. This is what we do at LinkedIn for instance as Todd mentioned
in his reply.

 
 Also, you say this is the limitation of the old consumer.  Does that mean
 that this problem goes away completely if one uses the new consumer?

This is sort of n/a at the moment as per-partition lag has not been
added yet to the new consumer. It does have the equivalent of max-lag.
If we add per-partition lag sensors we would need to be able to remove
those sensors if applicable after a rebalance.

 
  On the restart, the lag goes down to zero because - well the mbeans
  get recreated and the consumer starts fetching. If the fetch request
  reads up to the end of the log then the mbean will report zero. Your
  actual committed offset may be behind though which is why your true
  lag is  0.
 
  The lag mbeans are useful, but have a number of limitations - it
  depends on active fetches in progress;
 
 
 What do you mean by this?

If the fetcher threads die for any reason then fetches stop and the
consumer continues to report lag off the last fetched offset and the
last reported log end offset. So it will stay flat when it should be
increasing (since the log end offset on the broker is increasing if
producers are still sending to that partition).

Also, the old consumer pre-fetches chunks and buffers these
internally.  If the chunk queue is full fetches stop; and if the
consumer is extremely slow in actually processing the messages off
each chunk then lag can stay flat (perhaps even at zero) until the
next chunk, while the consumer is iterating messages off the previous
chunk.

 
  it also does not exactly
  correspond with your actual processed (and committed) offset.
 
 Right.  Though it should be updated in near real-time, so it will
 approximately match the reality, no?

Yes - I think it is fair to say that in most cases the lag mbeans
should be accurate within a small delta of the true lag. Although we
are trying to avoid further non-critical development on the old
consumer it is convenient to have these mbeans. So I think it may be
worth fixing this issue (i.e., deregistering mbeans on a rebalance).
Can you file a jira for this?

Thanks,

Joel

 
 Thanks,
 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/
 
 
 
  The most
  reliable way to monitor application lag is to use the committed
  offsets and the current log end offsets. Todd has been doing a lot of
  interesting work in making lag monitoring less painful and can comment
  more.
 
  Joel
 
  On Thu, Jun 04, 2015 at 04:55:44PM -0400, Otis Gospodnetić wrote:
   Hi,
  
   On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds sreyno...@twilio.com
  wrote:
  
I believe the JMX metrics reflect the consumer PRIOR to committing
  offsets
to Kafka / Zookeeper. But when you query from the command line using
  the
kafka tools, you are just getting the committed offsets.
   
  
   Even if that were the case, and maybe it is, it doesn't explain why the
   ConsumerLag in JMX often remains *completely constant*.forever...
  until
   the consumer is restarted.  You see what I mean?
  
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log Management
   Solr  Elasticsearch Support * http://sematext.com/
  
  
  
On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com
 wrote:
   
 Hi,

 Here's something potentially useful.

 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the
flat
 Lag situation

 2) I restarted the consumer whose lag is shown in the above graph

 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms
  -
NO
 lag at all!?

 So that 81560 Lag value that was stuck in JMX is gone.  Went down to
  0.
 Kind of makes sense - the whole consumer was restarted, consumer/java
 process was restarted, everything that was in JMX got reset, and if
  there
 is truly no consumer lag it makes sense that the values in JMX are 0.

 HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what
Offset
 Checker shows for this one consumer:

 af_servers  spm_cluster_topic-new-cdh  18  220551962
 220586078   

Re: Consumer lag lies - orphaned offsets?

2015-06-04 Thread Scott Reynolds
On Thu, Jun 4, 2015 at 1:55 PM, Otis Gospodnetić otis.gospodne...@gmail.com
 wrote:

 Hi,

 On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds sreyno...@twilio.com
 wrote:

  I believe the JMX metrics reflect the consumer PRIOR to committing
 offsets
  to Kafka / Zookeeper. But when you query from the command line using the
  kafka tools, you are just getting the committed offsets.
 

 Even if that were the case, and maybe it is, it doesn't explain why the
 ConsumerLag in JMX often remains *completely constant*.forever... until
 the consumer is restarted.  You see what I mean?


Agreed, I was merely pointing out both are lying.



 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/



  On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com
   wrote:
 
   Hi,
  
   Here's something potentially useful.
  
   1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the
  flat
   Lag situation
  
   2) I restarted the consumer whose lag is shown in the above graph
  
   3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms -
  NO
   lag at all!?
  
   So that 81560 Lag value that was stuck in JMX is gone.  Went down to 0.
   Kind of makes sense - the whole consumer was restarted, consumer/java
   process was restarted, everything that was in JMX got reset, and if
 there
   is truly no consumer lag it makes sense that the values in JMX are 0.
  
   HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what
  Offset
   Checker shows for this one consumer:
  
   af_servers  spm_cluster_topic-new-cdh  18  220551962
   220586078   34116
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  19  161936440
   161960377   23937
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  20  248308642
   248340350   31708
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  21  259901355
   259934911   33556
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  22  205274547
   205296950   22403
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  23  167756654
   167780028   23374
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  24  357517989
   357574627   56638
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  25  194313232
   194338154   24922
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  
   You can see Lag is NOT == 0 for any of the partitions!
  
   And yet, look what JMX is showing now -- all Lag for all partitions is
   supposedly 0. Always:
  
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20
 Value java.lang.Object = 0
  
  
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20
 Value 

Re: Consumer lag lies - orphaned offsets?

2015-06-04 Thread Joel Koshy
Hi Otis,

Yes this is a limitation in the old consumer. i.e., a number of
per-topic/partition mbeans remain even on a rebalance. Those need to
be de-registered. So if you stop consuming from some partition after a
rebalance that lag mbean currently remain which is why it remains
flat.  This is a known issue.

On the restart, the lag goes down to zero because - well the mbeans
get recreated and the consumer starts fetching. If the fetch request
reads up to the end of the log then the mbean will report zero. Your
actual committed offset may be behind though which is why your true
lag is  0.

The lag mbeans are useful, but have a number of limitations - it
depends on active fetches in progress; it also does not exactly
correspond with your actual processed (and committed) offset. The most
reliable way to monitor application lag is to use the committed
offsets and the current log end offsets. Todd has been doing a lot of
interesting work in making lag monitoring less painful and can comment
more.

Joel

On Thu, Jun 04, 2015 at 04:55:44PM -0400, Otis Gospodnetić wrote:
 Hi,
 
 On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds sreyno...@twilio.com wrote:
 
  I believe the JMX metrics reflect the consumer PRIOR to committing offsets
  to Kafka / Zookeeper. But when you query from the command line using the
  kafka tools, you are just getting the committed offsets.
 
 
 Even if that were the case, and maybe it is, it doesn't explain why the
 ConsumerLag in JMX often remains *completely constant*.forever... until
 the consumer is restarted.  You see what I mean?
 
 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/
 
 
 
  On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com
   wrote:
 
   Hi,
  
   Here's something potentially useful.
  
   1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the
  flat
   Lag situation
  
   2) I restarted the consumer whose lag is shown in the above graph
  
   3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms -
  NO
   lag at all!?
  
   So that 81560 Lag value that was stuck in JMX is gone.  Went down to 0.
   Kind of makes sense - the whole consumer was restarted, consumer/java
   process was restarted, everything that was in JMX got reset, and if there
   is truly no consumer lag it makes sense that the values in JMX are 0.
  
   HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what
  Offset
   Checker shows for this one consumer:
  
   af_servers  spm_cluster_topic-new-cdh  18  220551962
   220586078   34116
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  19  161936440
   161960377   23937
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  20  248308642
   248340350   31708
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  21  259901355
   259934911   33556
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  22  205274547
   205296950   22403
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  23  167756654
   167780028   23374
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  24  357517989
   357574627   56638
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
   af_servers  spm_cluster_topic-new-cdh  25  194313232
   194338154   24922
   af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  
   You can see Lag is NOT == 0 for any of the partitions!
  
   And yet, look what JMX is showing now -- all Lag for all partitions is
   supposedly 0. Always:
  
  
  
  kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17
 Value java.lang.Object = 0
  
  
  kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18
 Value java.lang.Object = 0
  
  
  kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18
 Value java.lang.Object = 0
  
  
  kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18
 Value java.lang.Object = 0
  
  
  kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18
 Value java.lang.Object = 0
  
  
  kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19
 Value java.lang.Object = 0
  
  
  

Re: Consumer lag lies - orphaned offsets?

2015-06-04 Thread Otis Gospodnetić
Hi,

On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds sreyno...@twilio.com wrote:

 I believe the JMX metrics reflect the consumer PRIOR to committing offsets
 to Kafka / Zookeeper. But when you query from the command line using the
 kafka tools, you are just getting the committed offsets.


Even if that were the case, and maybe it is, it doesn't explain why the
ConsumerLag in JMX often remains *completely constant*.forever... until
the consumer is restarted.  You see what I mean?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/



 On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:

  Hi,
 
  Here's something potentially useful.
 
  1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the
 flat
  Lag situation
 
  2) I restarted the consumer whose lag is shown in the above graph
 
  3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms -
 NO
  lag at all!?
 
  So that 81560 Lag value that was stuck in JMX is gone.  Went down to 0.
  Kind of makes sense - the whole consumer was restarted, consumer/java
  process was restarted, everything that was in JMX got reset, and if there
  is truly no consumer lag it makes sense that the values in JMX are 0.
 
  HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what
 Offset
  Checker shows for this one consumer:
 
  af_servers  spm_cluster_topic-new-cdh  18  220551962
  220586078   34116
  af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  af_servers  spm_cluster_topic-new-cdh  19  161936440
  161960377   23937
  af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  af_servers  spm_cluster_topic-new-cdh  20  248308642
  248340350   31708
  af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  af_servers  spm_cluster_topic-new-cdh  21  259901355
  259934911   33556
  af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  af_servers  spm_cluster_topic-new-cdh  22  205274547
  205296950   22403
  af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  af_servers  spm_cluster_topic-new-cdh  23  167756654
  167780028   23374
  af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  af_servers  spm_cluster_topic-new-cdh  24  357517989
  357574627   56638
  af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
  af_servers  spm_cluster_topic-new-cdh  25  194313232
  194338154   24922
  af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
 
  You can see Lag is NOT == 0 for any of the partitions!
 
  And yet, look what JMX is showing now -- all Lag for all partitions is
  supposedly 0. Always:
 
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20
Value java.lang.Object = 0
 
 
 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20
Value java.lang.Object = 0
 
 
 

Re: Consumer lag lies - orphaned offsets?

2015-06-04 Thread Jason Rosenberg
I assume you are looking at a 'MaxLag' metric, which reports the worst case
lag over a set of partitions.  Are you consuming multiple partitions, and
maybe one of them is stuck?

On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Hi,

 I've noticed that when we restart our Kafka consumers our consumer lag
 metric sometimes looks weird.

 Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH

 You can see lag go up around 15:00, when some consumers were restarted.
 The weird thing is that the lag remains flat!
 How could it remain flat if consumers are running? (they have enough juice
 to catch up!)

 What I think is happening is this:
 1) consumers are initially not really lagging
 2) consumers get stopped
 3) lag grows
 4) consumers get started again
 5) something shifts around...not sure what...
 6) consumers start consuming, and there is actually no lag, but the offsets
 written to ZK sometime during 3) don't get updated because after restart
 consumers are reading from somewhere else, not from partition(s) whose lag
 and offset delta jumped during 3)

 Oh, and:
 7) Kafka JMX still exposes all offsets, event those for partitions that are
 no longer being read, so the consumer lag metric remains constant/flat,
 even though consumers are actually not lagging on partitions from which
 they are now consuming.

 What bugs me is 7), because reading lag info from JMX looks like it's
 lying.

 Does this sound crazy or reasonable?

 If anyone has any comments/advice/suggestions for what one can do about
 this, I'm all ears!

 Thanks,
 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/



Re: Consumer lag lies - orphaned offsets?

2015-06-04 Thread Otis Gospodnetic
Hi Jason,

(note: Kafka 0.8.2. GA)
Got some new info below!  Could be a Kafka metrics bug

On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg j...@squareup.com wrote:

 I assume you are looking at a 'MaxLag' metric, which reports the worst case
 lag over a set of partitions.


No, we're looking at MBeans that look like this one:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10
  Value java.lang.Object = 0


 Are you consuming multiple partitions, and maybe one of them is stuck?


Don't think so...  Maybe what we are seeing is a Kafka bug.

Here is what we just discovered:

Dumped JMX on the consumer and we see this:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
  Value java.lang.Object = 81560

This 81560 is also what we see in SPM - see the flat line here:
https://apps.sematext.com/spm-reports/s/eQ9WhLegW9
(you can hover over a datapoint on that 81K line to see server name, topic,
and partition)

This 81560 is just not going down.  If I look at JMX in 5 minutes, it will
show the same value - the ConsumerLag of 81560!

BUT, this gives different numbers:

 /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh
kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group
af_servers | grep spm_cluster_topic-new-cdh

af_servers  spm_cluster_topic-new-cdh  24  *355209634*
*355249858*   40224
af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0

The delta between the bolded numbers is NOT 81560.  And if I run this
command N times the delta keeps going down, because the consumer is
catching up.  Just like you'd expect.

But the JMX number remains constant == could this be a Kafka metrics/JMX
bug?

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/




 On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:

  Hi,
 
  I've noticed that when we restart our Kafka consumers our consumer lag
  metric sometimes looks weird.
 
  Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH
 
  You can see lag go up around 15:00, when some consumers were restarted.
  The weird thing is that the lag remains flat!
  How could it remain flat if consumers are running? (they have enough
 juice
  to catch up!)
 
  What I think is happening is this:
  1) consumers are initially not really lagging
  2) consumers get stopped
  3) lag grows
  4) consumers get started again
  5) something shifts around...not sure what...
  6) consumers start consuming, and there is actually no lag, but the
 offsets
  written to ZK sometime during 3) don't get updated because after restart
  consumers are reading from somewhere else, not from partition(s) whose
 lag
  and offset delta jumped during 3)
 
  Oh, and:
  7) Kafka JMX still exposes all offsets, event those for partitions that
 are
  no longer being read, so the consumer lag metric remains constant/flat,
  even though consumers are actually not lagging on partitions from which
  they are now consuming.
 
  What bugs me is 7), because reading lag info from JMX looks like it's
  lying.
 
  Does this sound crazy or reasonable?
 
  If anyone has any comments/advice/suggestions for what one can do about
  this, I'm all ears!
 
  Thanks,
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 



Re: Consumer lag lies - orphaned offsets?

2015-06-04 Thread Otis Gospodnetic
Hi,

Here's something potentially useful.

1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the flat
Lag situation

2) I restarted the consumer whose lag is shown in the above graph

3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO
lag at all!?

So that 81560 Lag value that was stuck in JMX is gone.  Went down to 0.
Kind of makes sense - the whole consumer was restarted, consumer/java
process was restarted, everything that was in JMX got reset, and if there
is truly no consumer lag it makes sense that the values in JMX are 0.

HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what Offset
Checker shows for this one consumer:

af_servers  spm_cluster_topic-new-cdh  18  220551962
220586078   34116
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers  spm_cluster_topic-new-cdh  19  161936440
161960377   23937
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers  spm_cluster_topic-new-cdh  20  248308642
248340350   31708
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers  spm_cluster_topic-new-cdh  21  259901355
259934911   33556
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers  spm_cluster_topic-new-cdh  22  205274547
205296950   22403
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers  spm_cluster_topic-new-cdh  23  167756654
167780028   23374
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers  spm_cluster_topic-new-cdh  24  357517989
357574627   56638
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers  spm_cluster_topic-new-cdh  25  194313232
194338154   24922
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0

You can see Lag is NOT == 0 for any of the partitions!

And yet, look what JMX is showing now -- all Lag for all partitions is
supposedly 0. Always:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=21
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=21
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=21
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=21
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=22
  Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=22
  Value java.lang.Object = 0