[
https://issues.apache.org/jira/browse/SAMZA-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279282#comment-14279282
]
Chris Riccomini commented on SAMZA-503:
---------------------------------------
Yea, I can see that there's some confusion here. Sorry about that.
First, I think there are several boundaries at play:
# The offset of the most recent message on the Kafka broker.
# The offset of the most recent message that the BrokerProxy has consumed. This
is roughly equivalent to the offset of the most recent message put into the
BlockingEnvelopeMap (the queue of messages to be processed by StreamTasks).
# The offset of the most recent message that the StreamTask has processed.
# The offset that as the checkpoint offset the last time a checkpoint was sent.
These all change independently of one another.
Now, I think what Roger really wants is the difference between (1) and (3).
That is, he wants to know the difference between the last offset his StreamTask
saw, and the most recent offset in the Kafka broker. This will give you an
absolute measure of how far behind you are.
"messages-behind-high-watermark" is giving you the difference between (1) and
(2). The update-frequency issue that Roger is seeing happens because the
BrokerProxy *only* updates its MBHW metric when it fetches new data from the
Kafka broker, and it only does this when the there are < 50000 messages left in
the BlockingEnvelopeMap (the queue of messages to process). In a case where the
StreamTask is slow (e.g. it's sleeping as described in Roger's example), it
takes a very long time for the queue to get below 50000 messages, so the
BrokerProxy thread just sits and sleeps. It can take minutes for the metric to
get updated in this case.
bq. So a possible solution is to leave the "-messages-behind-high-watermark" as
is but add a new metric to compare the committed offset and the latest offset.
I agree with Yan. MBHW is useful as is, but it'd also be useful to have a
metric that measures (1) against (3).
It seems like there are actually two issues here:
# The MBHW metric should get updated even when the BrokerProxy thread doesn't
need to fetch more messages.
# We should add a new metric to track the latency between (1) and (3).
The solution I described above (fetch to check MBHW even if BrokerProxy doesn't
need more messages) should solve the first issue.
The second issue is a bit tricker.
# We don't currently impose any ordering for offsets. They have no semantic
meaning in Samza, except to be used as a unique identifier for a message. That
is, subtracting one offset from another might or might not be meaningful. If
the offset is a random GUID, subtracting offsets is non-sensical. In Kafka's
case, the BrokerProxy and KafkaSystemConsumer know that the offsets are
actually ordered numbers.
# You also need to know the most-recently-processed message, which is stored in
the OffsetManager inside SamzaContainer right now. The samza-kafka code is
unaware of this.
# Samza isn't currently aware of the high watermark. Only the BrokerProxy. The
samza-core code is unaware of this.
I need to think about how to solve this. Code cleanliness is my main concern.
> Lag gauge very slow to update for slow jobs
> -------------------------------------------
>
> Key: SAMZA-503
> URL: https://issues.apache.org/jira/browse/SAMZA-503
> Project: Samza
> Issue Type: Bug
> Components: metrics
> Affects Versions: 0.8.0
> Environment: Mac OS X, Oracle Java 7, ProcessJobFactory
> Reporter: Roger Hoover
> Assignee: Yan Fang
> Fix For: 0.9.0
>
>
> For slow jobs, the
> KafkaSystemConsumerMetrics.%s-%s-messages-behind-high-watermark) gauge does
> not get updated very often.
> To reproduce:
> * Create a job that processes one message and sleeps for 5 seconds
> * Create it's input topic but do not populate it yet
> * Start the job
> * Load 1000s of messages to it's input topic. You can keep adding messages
> with a "wait -n 1 <kafka console producer command>"
> What happens:
> * Run jconsole to view the JMX metrics
> * The %s-%s-messages-behind-high-watermark gauge will stay at 0 for a LONG
> time (~10 minutes?) before finally updating.
> What should happen:
> * The gauge should get updated at a reasonable interval (a least every few
> seconds)
> I think what's happening is that the BrokerProxy only updates the high
> watermark when a consumer is ready for more messages. When the job is so
> slow, this rarely happens to the metric doesn't get updated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)