Thanks, Chris. Your suggestion sounds like a great idea to me. Here's the JIRA (including your comment). https://issues.apache.org/jira/browse/SAMZA-503
On Mon, Jan 5, 2015 at 10:54 AM, Chris Riccomini < [email protected]> wrote: > Hey Roger, > > Using that metric is the right way to check your lag. > > The issue you raise about not updating the offset seems like something > that should be fixed. Can you open a JIRA for this? The fix should be > fairly straight forward. In the BrokerProxy, we currently just sleep if we > don't need to fetch messages. We could do a fetch simply to get the latest > offset, update the gauges, then sleep. This will still likely have a 1-2s > latency, but I think this is reasonable. > > The other thing to consider here is the reporting interval of metrics > reporter. In the case of the MetricsSnapshotReporter, it appears that we > report metrics at a 60s interval. > > // TODO could make this configurable. > executor.scheduleWithFixedDelay(this, 0, 60, TimeUnit.SECONDS) > > In the case of the JmxReporter, there is no interval--it's always up to > date. > > > Cheers, > Chris > > On 1/3/15 10:54 PM, "Roger Hoover" <[email protected]> wrote: > > >I dug around some more and see that there is a lag gauge > >(%s-%s-messages-behind-high-watermark) in KafkaSystemConsumerMetrics. The > >problem I'm seeing that the gauge does not get updated very often, > >especially when you want it the most. > > > >Here's my test setup. I created a job that processes a single message and > >sleep 5 seconds. In another shell, I have another process loading 1000 > >messages every second to the input topic. > > > >Using the Kafka GetOffsetShell tool, I can see an ever growing lag between > >the latest offset and the checkpointed offset. However, the > >%s-%s-messages-behind-high-watermark metric in JMX remains unchanged for > >long periods of time (10s of minutes at least). > > > >I think what's happening is that the BrokerProxy only updates the high > >watermark when a consumer is ready for more messages. When the job is so > >slow, this rarely happens to the metric doesn't get updated. > > > >On Sat, Jan 3, 2015 at 8:31 PM, Roger Hoover <[email protected]> > >wrote: > > > >> Hi, > >> > >> I was wonder what the best practice is for monitoring lag in Samza jobs. > >> It looks like only the commited offsets are stored in the > >> TaskInstanceMetrics so it seems there are two options for computing lag. > >> > >> a) have a process consume the metrics topic. When it receives a > >>message, > >> it would query Kafka brokers for latest offset and compute lag. The > >> problem is that this process will not deal with old messages correctly > >>(as > >> the brokers have moved on) if it even falls behind (because of > >>maintenance > >> or bugs). > >> > >> b) poll metrics via JMX and compute lag at the same time. The problem > >>is > >> that the polling process has to know how and have permission to connect > >>to > >> all Samza containers. With this approach, we loose the advantage of the > >> central metrics topic. > >> > >> Any suggestions? Would it be too much overhead to fetch and store the > >> latest offset for each SSP during checkpoint? I think this would allow > >>a > >> lag metric to be exposed for each SSP. > >> > >> Thanks, > >> > >> Roger > >> > >
