Yeah, I've written dissertations at this point on why MaxLag is flawed. We also used to use the offset checker tool, and later something similar that was a little easier to slot into our monitoring systems. Problems with all of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
For more details, you can also check out my blog post on the release: https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented -Todd On Wednesday, July 6, 2016, Tom Dearman <tom.dear...@gmail.com> wrote: > I recently had a problem on my production which I believe was a > manifestation of the issue kafka-2978 (Topic partition is not sometimes > consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and > we will upgrade our client soon. However, it made me realise that I didn’t > have any monitoring set up on this. The only thing I can find as a metric > is the > kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+), > which, if I understand correctly, is the max lag of any partition that that > particular consumer is consuming. > 1. If I had been monitoring this, and if my consumer was suffering from > the issue in kafka-2978, would I actually have been alerted, i.e. since the > consumer would think it is consuming correctly would it not have updated > the metric. > 2. There is another way to see offset lag using the command > /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server > 10.10.1.61:9092 --describe —group consumer_group_name and parsing the > response. Is it safe or advisable to do this? I like the fact that it > tells me each partition lag, although it is also not available if no > consumer from the group is currently consuming. > 3. Is there a better way of doing this? -- *Todd Palino* Staff Site Reliability Engineer Data Infrastructure Streaming linkedin.com/in/toddpalino