I dug around some more and see that there is a lag gauge (%s-%s-messages-behind-high-watermark) in KafkaSystemConsumerMetrics. The problem I'm seeing that the gauge does not get updated very often, especially when you want it the most.
Here's my test setup. I created a job that processes a single message and sleep 5 seconds. In another shell, I have another process loading 1000 messages every second to the input topic. Using the Kafka GetOffsetShell tool, I can see an ever growing lag between the latest offset and the checkpointed offset. However, the %s-%s-messages-behind-high-watermark metric in JMX remains unchanged for long periods of time (10s of minutes at least). I think what's happening is that the BrokerProxy only updates the high watermark when a consumer is ready for more messages. When the job is so slow, this rarely happens to the metric doesn't get updated. On Sat, Jan 3, 2015 at 8:31 PM, Roger Hoover <[email protected]> wrote: > Hi, > > I was wonder what the best practice is for monitoring lag in Samza jobs. > It looks like only the commited offsets are stored in the > TaskInstanceMetrics so it seems there are two options for computing lag. > > a) have a process consume the metrics topic. When it receives a message, > it would query Kafka brokers for latest offset and compute lag. The > problem is that this process will not deal with old messages correctly (as > the brokers have moved on) if it even falls behind (because of maintenance > or bugs). > > b) poll metrics via JMX and compute lag at the same time. The problem is > that the polling process has to know how and have permission to connect to > all Samza containers. With this approach, we loose the advantage of the > central metrics topic. > > Any suggestions? Would it be too much overhead to fetch and store the > latest offset for each SSP during checkpoint? I think this would allow a > lag metric to be exposed for each SSP. > > Thanks, > > Roger >
