I dug around some more and see that there is a lag gauge
(%s-%s-messages-behind-high-watermark) in KafkaSystemConsumerMetrics.  The
problem I'm seeing that the gauge does not get updated very often,
especially when you want it the most.

Here's my test setup.  I created a job that processes a single message and
sleep 5 seconds.  In another shell, I have another process loading 1000
messages every second to the input topic.

Using the Kafka GetOffsetShell tool, I can see an ever growing lag between
the latest offset and the checkpointed offset.  However, the
%s-%s-messages-behind-high-watermark metric in JMX remains unchanged for
long periods of time (10s of minutes at least).

I think what's happening is that the BrokerProxy only updates the high
watermark when a consumer is ready for more messages.  When the job is so
slow, this rarely happens to the metric doesn't get updated.

On Sat, Jan 3, 2015 at 8:31 PM, Roger Hoover <[email protected]> wrote:

> Hi,
>
> I was wonder what the best practice is for monitoring lag in Samza jobs.
> It looks like only the commited offsets are stored in the
> TaskInstanceMetrics so it seems there are two options for computing lag.
>
>  a) have a process consume the metrics topic.  When it receives a message,
> it would query Kafka brokers for latest offset and compute lag.  The
> problem is that this process will not deal with old messages correctly (as
> the brokers have moved on) if it even falls behind (because of maintenance
> or bugs).
>
> b) poll metrics via JMX and compute lag at the same time.  The problem is
> that the polling process has to know how and have permission to connect to
> all Samza containers.  With this approach, we loose the advantage of the
> central metrics topic.
>
> Any suggestions?  Would it be too much overhead to fetch and store the
> latest offset for each SSP during checkpoint?  I think this would allow a
> lag metric to be exposed for each SSP.
>
> Thanks,
>
> Roger
>

Reply via email to