Re: Monitoring lag in Samza jobs

Roger Hoover Mon, 05 Jan 2015 22:33:58 -0800

Thanks, Chris.  Your suggestion sounds like a great idea to me.

Here's the JIRA (including your comment).
https://issues.apache.org/jira/browse/SAMZA-503


On Mon, Jan 5, 2015 at 10:54 AM, Chris Riccomini <
[email protected]> wrote:

> Hey Roger,
>
> Using that metric is the right way to check your lag.
>
> The issue you raise about not updating the offset seems like something
> that should be fixed. Can you open a JIRA for this? The fix should be
> fairly straight forward. In the BrokerProxy, we currently just sleep if we
> don't need to fetch messages. We could do a fetch simply to get the latest
> offset, update the gauges, then sleep. This will still likely have a 1-2s
> latency, but I think this is reasonable.
>
> The other thing to consider here is the reporting interval of metrics
> reporter. In the case of the MetricsSnapshotReporter, it appears that we
> report metrics at a 60s interval.
>
>     // TODO could make this configurable.
>     executor.scheduleWithFixedDelay(this, 0, 60, TimeUnit.SECONDS)
>
> In the case of the JmxReporter, there is no interval--it's always up to
> date.
>
>
> Cheers,
> Chris
>
> On 1/3/15 10:54 PM, "Roger Hoover" <[email protected]> wrote:
>
> >I dug around some more and see that there is a lag gauge
> >(%s-%s-messages-behind-high-watermark) in KafkaSystemConsumerMetrics.  The
> >problem I'm seeing that the gauge does not get updated very often,
> >especially when you want it the most.
> >
> >Here's my test setup.  I created a job that processes a single message and
> >sleep 5 seconds.  In another shell, I have another process loading 1000
> >messages every second to the input topic.
> >
> >Using the Kafka GetOffsetShell tool, I can see an ever growing lag between
> >the latest offset and the checkpointed offset.  However, the
> >%s-%s-messages-behind-high-watermark metric in JMX remains unchanged for
> >long periods of time (10s of minutes at least).
> >
> >I think what's happening is that the BrokerProxy only updates the high
> >watermark when a consumer is ready for more messages.  When the job is so
> >slow, this rarely happens to the metric doesn't get updated.
> >
> >On Sat, Jan 3, 2015 at 8:31 PM, Roger Hoover <[email protected]>
> >wrote:
> >
> >> Hi,
> >>
> >> I was wonder what the best practice is for monitoring lag in Samza jobs.
> >> It looks like only the commited offsets are stored in the
> >> TaskInstanceMetrics so it seems there are two options for computing lag.
> >>
> >>  a) have a process consume the metrics topic.  When it receives a
> >>message,
> >> it would query Kafka brokers for latest offset and compute lag.  The
> >> problem is that this process will not deal with old messages correctly
> >>(as
> >> the brokers have moved on) if it even falls behind (because of
> >>maintenance
> >> or bugs).
> >>
> >> b) poll metrics via JMX and compute lag at the same time.  The problem
> >>is
> >> that the polling process has to know how and have permission to connect
> >>to
> >> all Samza containers.  With this approach, we loose the advantage of the
> >> central metrics topic.
> >>
> >> Any suggestions?  Would it be too much overhead to fetch and store the
> >> latest offset for each SSP during checkpoint?  I think this would allow
> >>a
> >> lag metric to be exposed for each SSP.
> >>
> >> Thanks,
> >>
> >> Roger
> >>
>
>

Re: Monitoring lag in Samza jobs

Reply via email to