Re: Consumer lag lies - orphaned offsets?
On Fri, Jun 05, 2015 at 12:53:00AM -0400, Otis Gospodnetić wrote: Hi Joel, On Thu, Jun 4, 2015 at 8:52 PM, Joel Koshy jjkosh...@gmail.com wrote: Hi Otis, Yes this is a limitation in the old consumer. i.e., a number of per-topic/partition mbeans remain even on a rebalance. Those need to be de-registered. So if you stop consuming from some partition after a rebalance that lag mbean currently remain which is why it remains flat. This is a known issue. I see. Is / should this be considered a bug? Something worth fixing for 0.8.3? Yes I would call it a bug, but it hasn't been a high priority so far mainly because (I think) most users monitor lag with committed offsets. This is what we do at LinkedIn for instance as Todd mentioned in his reply. Also, you say this is the limitation of the old consumer. Does that mean that this problem goes away completely if one uses the new consumer? This is sort of n/a at the moment as per-partition lag has not been added yet to the new consumer. It does have the equivalent of max-lag. If we add per-partition lag sensors we would need to be able to remove those sensors if applicable after a rebalance. On the restart, the lag goes down to zero because - well the mbeans get recreated and the consumer starts fetching. If the fetch request reads up to the end of the log then the mbean will report zero. Your actual committed offset may be behind though which is why your true lag is 0. The lag mbeans are useful, but have a number of limitations - it depends on active fetches in progress; What do you mean by this? If the fetcher threads die for any reason then fetches stop and the consumer continues to report lag off the last fetched offset and the last reported log end offset. So it will stay flat when it should be increasing (since the log end offset on the broker is increasing if producers are still sending to that partition). Also, the old consumer pre-fetches chunks and buffers these internally. If the chunk queue is full fetches stop; and if the consumer is extremely slow in actually processing the messages off each chunk then lag can stay flat (perhaps even at zero) until the next chunk, while the consumer is iterating messages off the previous chunk. it also does not exactly correspond with your actual processed (and committed) offset. Right. Though it should be updated in near real-time, so it will approximately match the reality, no? Yes - I think it is fair to say that in most cases the lag mbeans should be accurate within a small delta of the true lag. Although we are trying to avoid further non-critical development on the old consumer it is convenient to have these mbeans. So I think it may be worth fixing this issue (i.e., deregistering mbeans on a rebalance). Can you file a jira for this? Thanks, Joel Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ The most reliable way to monitor application lag is to use the committed offsets and the current log end offsets. Todd has been doing a lot of interesting work in making lag monitoring less painful and can comment more. Joel On Thu, Jun 04, 2015 at 04:55:44PM -0400, Otis Gospodnetić wrote: Hi, On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds sreyno...@twilio.com wrote: I believe the JMX metrics reflect the consumer PRIOR to committing offsets to Kafka / Zookeeper. But when you query from the command line using the kafka tools, you are just getting the committed offsets. Even if that were the case, and maybe it is, it doesn't explain why the ConsumerLag in JMX often remains *completely constant*.forever... until the consumer is restarted. You see what I mean? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Here's something potentially useful. 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the flat Lag situation 2) I restarted the consumer whose lag is shown in the above graph 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO lag at all!? So that 81560 Lag value that was stuck in JMX is gone. Went down to 0. Kind of makes sense - the whole consumer was restarted, consumer/java process was restarted, everything that was in JMX got reset, and if there is truly no consumer lag it makes sense that the values in JMX are 0. HOWEVER, is the Lag *really* always *exactly* 0? No way. Look what Offset Checker shows for this one consumer: af_servers spm_cluster_topic-new-cdh 18 220551962 220586078
Re: Consumer lag lies - orphaned offsets?
On Thu, Jun 4, 2015 at 1:55 PM, Otis Gospodnetić otis.gospodne...@gmail.com wrote: Hi, On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds sreyno...@twilio.com wrote: I believe the JMX metrics reflect the consumer PRIOR to committing offsets to Kafka / Zookeeper. But when you query from the command line using the kafka tools, you are just getting the committed offsets. Even if that were the case, and maybe it is, it doesn't explain why the ConsumerLag in JMX often remains *completely constant*.forever... until the consumer is restarted. You see what I mean? Agreed, I was merely pointing out both are lying. Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Here's something potentially useful. 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the flat Lag situation 2) I restarted the consumer whose lag is shown in the above graph 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO lag at all!? So that 81560 Lag value that was stuck in JMX is gone. Went down to 0. Kind of makes sense - the whole consumer was restarted, consumer/java process was restarted, everything that was in JMX got reset, and if there is truly no consumer lag it makes sense that the values in JMX are 0. HOWEVER, is the Lag *really* always *exactly* 0? No way. Look what Offset Checker shows for this one consumer: af_servers spm_cluster_topic-new-cdh 18 220551962 220586078 34116 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 19 161936440 161960377 23937 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 20 248308642 248340350 31708 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 21 259901355 259934911 33556 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 22 205274547 205296950 22403 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 23 167756654 167780028 23374 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 24 357517989 357574627 56638 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 25 194313232 194338154 24922 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 You can see Lag is NOT == 0 for any of the partitions! And yet, look what JMX is showing now -- all Lag for all partitions is supposedly 0. Always: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20 Value
Re: Consumer lag lies - orphaned offsets?
Hi Otis, Yes this is a limitation in the old consumer. i.e., a number of per-topic/partition mbeans remain even on a rebalance. Those need to be de-registered. So if you stop consuming from some partition after a rebalance that lag mbean currently remain which is why it remains flat. This is a known issue. On the restart, the lag goes down to zero because - well the mbeans get recreated and the consumer starts fetching. If the fetch request reads up to the end of the log then the mbean will report zero. Your actual committed offset may be behind though which is why your true lag is 0. The lag mbeans are useful, but have a number of limitations - it depends on active fetches in progress; it also does not exactly correspond with your actual processed (and committed) offset. The most reliable way to monitor application lag is to use the committed offsets and the current log end offsets. Todd has been doing a lot of interesting work in making lag monitoring less painful and can comment more. Joel On Thu, Jun 04, 2015 at 04:55:44PM -0400, Otis Gospodnetić wrote: Hi, On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds sreyno...@twilio.com wrote: I believe the JMX metrics reflect the consumer PRIOR to committing offsets to Kafka / Zookeeper. But when you query from the command line using the kafka tools, you are just getting the committed offsets. Even if that were the case, and maybe it is, it doesn't explain why the ConsumerLag in JMX often remains *completely constant*.forever... until the consumer is restarted. You see what I mean? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Here's something potentially useful. 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the flat Lag situation 2) I restarted the consumer whose lag is shown in the above graph 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO lag at all!? So that 81560 Lag value that was stuck in JMX is gone. Went down to 0. Kind of makes sense - the whole consumer was restarted, consumer/java process was restarted, everything that was in JMX got reset, and if there is truly no consumer lag it makes sense that the values in JMX are 0. HOWEVER, is the Lag *really* always *exactly* 0? No way. Look what Offset Checker shows for this one consumer: af_servers spm_cluster_topic-new-cdh 18 220551962 220586078 34116 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 19 161936440 161960377 23937 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 20 248308642 248340350 31708 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 21 259901355 259934911 33556 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 22 205274547 205296950 22403 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 23 167756654 167780028 23374 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 24 357517989 357574627 56638 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 25 194313232 194338154 24922 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 You can see Lag is NOT == 0 for any of the partitions! And yet, look what JMX is showing now -- all Lag for all partitions is supposedly 0. Always: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19 Value java.lang.Object = 0
Re: Consumer lag lies - orphaned offsets?
Hi, On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds sreyno...@twilio.com wrote: I believe the JMX metrics reflect the consumer PRIOR to committing offsets to Kafka / Zookeeper. But when you query from the command line using the kafka tools, you are just getting the committed offsets. Even if that were the case, and maybe it is, it doesn't explain why the ConsumerLag in JMX often remains *completely constant*.forever... until the consumer is restarted. You see what I mean? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Here's something potentially useful. 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the flat Lag situation 2) I restarted the consumer whose lag is shown in the above graph 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO lag at all!? So that 81560 Lag value that was stuck in JMX is gone. Went down to 0. Kind of makes sense - the whole consumer was restarted, consumer/java process was restarted, everything that was in JMX got reset, and if there is truly no consumer lag it makes sense that the values in JMX are 0. HOWEVER, is the Lag *really* always *exactly* 0? No way. Look what Offset Checker shows for this one consumer: af_servers spm_cluster_topic-new-cdh 18 220551962 220586078 34116 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 19 161936440 161960377 23937 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 20 248308642 248340350 31708 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 21 259901355 259934911 33556 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 22 205274547 205296950 22403 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 23 167756654 167780028 23374 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 24 357517989 357574627 56638 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 25 194313232 194338154 24922 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 You can see Lag is NOT == 0 for any of the partitions! And yet, look what JMX is showing now -- all Lag for all partitions is supposedly 0. Always: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20 Value java.lang.Object = 0
Re: Consumer lag lies - orphaned offsets?
I assume you are looking at a 'MaxLag' metric, which reports the worst case lag over a set of partitions. Are you consuming multiple partitions, and maybe one of them is stuck? On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I've noticed that when we restart our Kafka consumers our consumer lag metric sometimes looks weird. Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH You can see lag go up around 15:00, when some consumers were restarted. The weird thing is that the lag remains flat! How could it remain flat if consumers are running? (they have enough juice to catch up!) What I think is happening is this: 1) consumers are initially not really lagging 2) consumers get stopped 3) lag grows 4) consumers get started again 5) something shifts around...not sure what... 6) consumers start consuming, and there is actually no lag, but the offsets written to ZK sometime during 3) don't get updated because after restart consumers are reading from somewhere else, not from partition(s) whose lag and offset delta jumped during 3) Oh, and: 7) Kafka JMX still exposes all offsets, event those for partitions that are no longer being read, so the consumer lag metric remains constant/flat, even though consumers are actually not lagging on partitions from which they are now consuming. What bugs me is 7), because reading lag info from JMX looks like it's lying. Does this sound crazy or reasonable? If anyone has any comments/advice/suggestions for what one can do about this, I'm all ears! Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/
Re: Consumer lag lies - orphaned offsets?
Hi Jason, (note: Kafka 0.8.2. GA) Got some new info below! Could be a Kafka metrics bug On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg j...@squareup.com wrote: I assume you are looking at a 'MaxLag' metric, which reports the worst case lag over a set of partitions. No, we're looking at MBeans that look like this one: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10 Value java.lang.Object = 0 Are you consuming multiple partitions, and maybe one of them is stuck? Don't think so... Maybe what we are seeing is a Kafka bug. Here is what we just discovered: Dumped JMX on the consumer and we see this: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24 Value java.lang.Object = 81560 This 81560 is also what we see in SPM - see the flat line here: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 (you can hover over a datapoint on that 81K line to see server name, topic, and partition) This 81560 is just not going down. If I look at JMX in 5 minutes, it will show the same value - the ConsumerLag of 81560! BUT, this gives different numbers: /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group af_servers | grep spm_cluster_topic-new-cdh af_servers spm_cluster_topic-new-cdh 24 *355209634* *355249858* 40224 af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0 The delta between the bolded numbers is NOT 81560. And if I run this command N times the delta keeps going down, because the consumer is catching up. Just like you'd expect. But the JMX number remains constant == could this be a Kafka metrics/JMX bug? Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I've noticed that when we restart our Kafka consumers our consumer lag metric sometimes looks weird. Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH You can see lag go up around 15:00, when some consumers were restarted. The weird thing is that the lag remains flat! How could it remain flat if consumers are running? (they have enough juice to catch up!) What I think is happening is this: 1) consumers are initially not really lagging 2) consumers get stopped 3) lag grows 4) consumers get started again 5) something shifts around...not sure what... 6) consumers start consuming, and there is actually no lag, but the offsets written to ZK sometime during 3) don't get updated because after restart consumers are reading from somewhere else, not from partition(s) whose lag and offset delta jumped during 3) Oh, and: 7) Kafka JMX still exposes all offsets, event those for partitions that are no longer being read, so the consumer lag metric remains constant/flat, even though consumers are actually not lagging on partitions from which they are now consuming. What bugs me is 7), because reading lag info from JMX looks like it's lying. Does this sound crazy or reasonable? If anyone has any comments/advice/suggestions for what one can do about this, I'm all ears! Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/
Re: Consumer lag lies - orphaned offsets?
Hi, Here's something potentially useful. 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the flat Lag situation 2) I restarted the consumer whose lag is shown in the above graph 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO lag at all!? So that 81560 Lag value that was stuck in JMX is gone. Went down to 0. Kind of makes sense - the whole consumer was restarted, consumer/java process was restarted, everything that was in JMX got reset, and if there is truly no consumer lag it makes sense that the values in JMX are 0. HOWEVER, is the Lag *really* always *exactly* 0? No way. Look what Offset Checker shows for this one consumer: af_servers spm_cluster_topic-new-cdh 18 220551962 220586078 34116 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 19 161936440 161960377 23937 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 20 248308642 248340350 31708 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 21 259901355 259934911 33556 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 22 205274547 205296950 22403 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 23 167756654 167780028 23374 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 24 357517989 357574627 56638 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 af_servers spm_cluster_topic-new-cdh 25 194313232 194338154 24922 af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0 You can see Lag is NOT == 0 for any of the partitions! And yet, look what JMX is showing now -- all Lag for all partitions is supposedly 0. Always: kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=21 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=21 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=21 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=21 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=22 Value java.lang.Object = 0 kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=22 Value java.lang.Object = 0