Hey Roger,
Using that metric is the right way to check your lag.
The issue you raise about not updating the offset seems like something
that should be fixed. Can you open a JIRA for this? The fix should be
fairly straight forward. In the BrokerProxy, we currently just sleep if we
don't need to fetch messages. We could do a fetch simply to get the latest
offset, update the gauges, then sleep. This will still likely have a 1-2s
latency, but I think this is reasonable.
The other thing to consider here is the reporting interval of metrics
reporter. In the case of the MetricsSnapshotReporter, it appears that we
report metrics at a 60s interval.
// TODO could make this configurable.
executor.scheduleWithFixedDelay(this, 0, 60, TimeUnit.SECONDS)
In the case of the JmxReporter, there is no interval--it's always up to
date.
Cheers,
Chris
On 1/3/15 10:54 PM, "Roger Hoover" <[email protected]> wrote:
>I dug around some more and see that there is a lag gauge
>(%s-%s-messages-behind-high-watermark) in KafkaSystemConsumerMetrics. The
>problem I'm seeing that the gauge does not get updated very often,
>especially when you want it the most.
>
>Here's my test setup. I created a job that processes a single message and
>sleep 5 seconds. In another shell, I have another process loading 1000
>messages every second to the input topic.
>
>Using the Kafka GetOffsetShell tool, I can see an ever growing lag between
>the latest offset and the checkpointed offset. However, the
>%s-%s-messages-behind-high-watermark metric in JMX remains unchanged for
>long periods of time (10s of minutes at least).
>
>I think what's happening is that the BrokerProxy only updates the high
>watermark when a consumer is ready for more messages. When the job is so
>slow, this rarely happens to the metric doesn't get updated.
>
>On Sat, Jan 3, 2015 at 8:31 PM, Roger Hoover <[email protected]>
>wrote:
>
>> Hi,
>>
>> I was wonder what the best practice is for monitoring lag in Samza jobs.
>> It looks like only the commited offsets are stored in the
>> TaskInstanceMetrics so it seems there are two options for computing lag.
>>
>> a) have a process consume the metrics topic. When it receives a
>>message,
>> it would query Kafka brokers for latest offset and compute lag. The
>> problem is that this process will not deal with old messages correctly
>>(as
>> the brokers have moved on) if it even falls behind (because of
>>maintenance
>> or bugs).
>>
>> b) poll metrics via JMX and compute lag at the same time. The problem
>>is
>> that the polling process has to know how and have permission to connect
>>to
>> all Samza containers. With this approach, we loose the advantage of the
>> central metrics topic.
>>
>> Any suggestions? Would it be too much overhead to fetch and store the
>> latest offset for each SSP during checkpoint? I think this would allow
>>a
>> lag metric to be exposed for each SSP.
>>
>> Thanks,
>>
>> Roger
>>