Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Tony Tang via dev Wed, 13 May 2026 17:24:09 -0700

Hi Kevin,

Re KW3: That makes sense to me. In particular, if we only register the
election timeout metric during Unattached/Prospective/Candidate, those are
all short states. After a timeout the node changes status, and the metric
gets unregistered. Since nodes spend most of their time in Follower or
Leader, operators may never see this metric at all. So yeah, let's just
always report these metrics.


Re KW8:

Implementing this like case 1 could diverge from the meaning of the metric
> in some cases. For example, expiring the fetch timeout for a voter always
> causes a transition to Prospective, so this metric is not updated on the
> next poll(). It also stops getting reported.

Since we decide not registering/unregistering metrics based on state, the
above concerns disappear, right?

observers do not transition to another state
> when its fetch timeout expires, so every subsequent poll would increase
> this metric's value, which is technically wrong based on what we defined in
> the KIP.

Okay, I understand the observer concern. Let's assume this KIP will be
implemented after KAFKA-20514, so the observer fetch timeout behavior will
already be fixed by then.

In summary:
1. we always report the metrics.
2. we only increment the counter in the poll methods when isExpired() is
called, as we originally planned. Does that work for you?

Thanks,
Tony

On Wed, May 13, 2026 at 12:10 PM Kevin Wu <[email protected]> wrote:

> Hi Tony,
>
> KW3: Sorry for going back and forth on this, but one issue I see with not
> reporting these metrics when the node's current EpochState does not contain
> the timer is that downstream observability platforms may not be able to
> capture these values very well. For example, the fetch timeout can only
> expire once on a FollowerAsVoter before it transitions to Prospective, and
> that value would not be reported/visible to operators. Only upon returning
> to FollowerState would the operator see the fetch timeout metric again,
> which cannot increment without first not reporting any value. That is not
> intuitive metric behavior to me, as the operator would see the value "too
> late".
>
> In summary, I think it is fine to always report these metrics, and if it
> turns out that is wrong or confusing, that's an implementation detail we
> can fix later. I think I was trying to draw too many parallels with the
> KIP-853 leader metrics I implemented previously. Leader and Follower are
> "long-lived" states so I think it makes sense to report metrics which only
> change in these states, during these states. Timer expirations specifically
> cause transitions to and between more "intermittent" states (e.g.
> Prospective, Candidate, etc.), and therefore occur at the "end" of an
> EpochState, so registering and unregistering them does not seem like a good
> idea.
>
> KW8: Yeah, basically. Implementing this like case 1 could diverge from the
> meaning of the metric in some cases. For example, expiring the fetch
> timeout for a voter always causes a transition to Prospective, so this
> metric is not updated on the next poll(). It also stops getting reported.
> However, in the current code, observers do not transition to another state
> when its fetch timeout expires, so every subsequent poll would increase
> this metric's value, which is technically wrong based on what we defined in
> the KIP.
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/KAFKA-20514__;!!Ayb5sqE7!qUcUv6MZLg9t1tNcXiCJCDe1vw-YajXBZeS9xD-Un3niEylSMbS3gHZyccO2mgQha4feqEvYJSE5sbQTnAhaooA$
> fixes the issue
> for the fetch timer specifically.
>
> This comment specifically is just something I wanted to make you aware of
> during your implementation, since it was not super obvious to me that this
> interaction with poll(), EpochState transitions, and our proposed metrics
> occurred until a deeper look at the code.
>
> Best,
> Kevin Wu
>
> On Wed, May 13, 2026 at 8:51 AM Tony Tang via dev <[email protected]>
> wrote:
>
> > Hi Kevin,
> >
> > Thanks for the feedback.
> > Re KW 5,6,7: I agree with all three. I'll update the KIP
> > Re KW 8:
> > Case 1: The the poll methods in KafkaRaftClient actively calls
> isExpired()
> > on the timer, and when it finds the timer has expired, it increments the
> > counter. The counting logic lives in the caller
> > Case 2: The timer itself is aware of its expiration and automatically
> > increments the counter when it transitions from not expired to expired.
> The
> > counting logic lives inside the timer.
> > Is that the cases you're talking about?
> >
> > Best,
> > Tony
> >
> >
> > On Tue, May 12, 2026 at 5:42 PM Kevin Wu <[email protected]> wrote:
> >
> > > Hi Tony,
> > >
> > > Thanks for the updates to the KIP. I have a few minor comments:
> > >
> > > KW5: nit, but can we say “this metric is registered only…” instead of
> > > “Registered only…” in the description column?
> > >
> > > KW6: What do you think about renaming the metric to
> > “timeout-expirations”?
> > > This would be slightly less verbose than what we have now.
> > >
> > > KW7: Instead of having separate columns for the metric tag and name,
> can
> > we
> > > have one column with the entire MBean name? For example:
> > >
> > >
> >
> “kafka.server:type=node-metrics,name=maximum-supported-level,feature-name=X”.
> > >
> > > KW8: Something to consider in determining when these metric values
> should
> > > be incremented: should  the value be incremented when the raft client
> > reads
> > > the timer and finds it has expired? Or, should the value be incremented
> > > only when the timer goes from not expired to expired? The first is
> simple
> > > but means the metric value is slightly different than what we’ve
> > described
> > > in the KIP. I feel like implementing the second may be pretty complex…
> > >
> > > Best,
> > > Kevin Wu
> > >
> > > On Fri, May 8, 2026 at 10:06 AM Tony Tang via dev <
> [email protected]>
> > > wrote:
> > >
> > > > Hi Justine,
> > > >
> > > > Thanks for the feedback!
> > > > I looked through the Kafka codebase and found a couple of existing
> > > timeout
> > > > metrics:
> > > >
> > > > 1.worker-poll-timeout-count in Kafka Connect, which is a cumulative
> > count
> > > > of poll timeouts, implemented as a Gauge using an AtomicLong counter.
> > > This
> > > > is the closest to what we're proposing.
> > > > 2. AcquisitionLockTimeoutPerSec in Share Groups, which is a Meter
> that
> > > > tracks the rate of acquisition lock timeouts.
> > > >
> > > > I think our KIP follows a similar approach to
> worker-poll-timeout-count
> > > >
> > > > Thanks,
> > > > Tony
> > > >
> > > > On Thu, May 7, 2026 at 5:21 PM Justine Olshan via dev <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > Hey Tony,
> > > > >
> > > > > Thanks for the KIP. Overall, I think the idea makes sense and it
> > seems
> > > > like
> > > > > you and Kevin are getting closer to agreement on the exact
> definition
> > > of
> > > > > the metrics.
> > > > > I was curious, are there any other timeout metrics you can find in
> > > Kafka
> > > > > and how are those defined? We don't necessarily need to do the
> same,
> > > but
> > > > > was curious if there was any precedent for this type of metric.
> > > > >
> > > > > Thanks,
> > > > > Justine
> > > > >
> > > > > On Mon, May 4, 2026 at 9:25 AM Tony Tang via dev <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > KW3: Ok, I agree that metric values at time X should be an
> accurate
> > > > > > snapshot of the node at time X, and that collecting historical
> > values
> > > > is
> > > > > > not the responsibility of kafka. The approach you described makes
> > > sense
> > > > > to
> > > > > > me: we keep the internal counter across state transitions, but
> only
> > > > > > register/unregister the metric based on whether the underlying
> > timer
> > > > > > exists. I'm on board with that approach
> > > > > >
> > > > > > KW4: That's a great suggestion. However, I think it's out of
> scope
> > > for
> > > > > this
> > > > > > KIP. I'd prefer to keep this KIP focused on timeout expiration
> > > metrics
> > > > > and
> > > > > > maybe we can discuss the state transition counting in the future.
> > > > > >
> > > > > > Best,
> > > > > > Tony
> > > > > >
> > > > > > On Fri, May 1, 2026 at 5:12 PM Kevin Wu <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > Hi Tony,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > KW3: I don't think "count" metrics like the ones being
> discussed
> > in
> > > > > this
> > > > > > > KIP should report values when the objects they are associated
> > with
> > > do
> > > > > not
> > > > > > > exist. This would mean metric values at time X are not an
> > accurate
> > > > > > > "snapshot" of the Kafka node at time X. In my opinion,
> collecting
> > > the
> > > > > > > historic values for a metric, visualizing them through
> > dashboards,
> > > > and
> > > > > > > monitoring them to alert operators are the responsibilities of
> a
> > > > > > downstream
> > > > > > > observability software, not Kafka. Kafka does have the
> capability
> > > to
> > > > > > create
> > > > > > > "derivative" metrics (i.e. metrics whose values are based off
> of
> > > > > sampling
> > > > > > > something else over time) via Sensors, but I don't think that
> > fits
> > > > our
> > > > > > use
> > > > > > > case as previously discussed.
> > > > > > >
> > > > > > > Another way to think about it is that adding or removing a
> metric
> > > so
> > > > > > Kafka
> > > > > > > starts or stops reporting a value to an observability service
> > > > actually
> > > > > > > tells the user more information about Kafka compared to
> > > > unconditionally
> > > > > > > reporting said metric value. Additionally, just because you
> > remove
> > > a
> > > > > > metric
> > > > > > > does not mean you need to remove the corresponding counter
> within
> > > the
> > > > > > > KafkaRaftMetrics object. For example, a node fetch request
> times
> > > out
> > > > 5
> > > > > > > times as a follower, so the node reports 5 for the metric.
> Next,
> > > the
> > > > > node
> > > > > > > becomes the leader, so it stops reporting the metric (i.e. the
> > > metric
> > > > > has
> > > > > > > no value). Then it becomes a follower, and starts reporting 5
> > > again.
> > > > > When
> > > > > > > the next fetch timeout happens, the node reports 6 for the
> > metric.
> > > > What
> > > > > > do
> > > > > > > you think about this behavior?
> > > > > > >
> > > > > > > KW4: If the desire is for a metric that always reports a value,
> > > what
> > > > do
> > > > > > you
> > > > > > > think about a metric that counts the number of `EpochState`
> > > > > transitions?
> > > > > > I
> > > > > > > think this value makes sense to report this value for the
> > lifetime
> > > > of a
> > > > > > > process, and generally, frequent state transitions are an
> > > indication
> > > > > > > something is wrong with the cluster. This would be an
> additional
> > > > metric
> > > > > > to
> > > > > > > the ones we discussed above.
> > > > > > >
> > > > > > > Best,
> > > > > > > Kevin Wu
> > > > > > >
> > > > > > > On Fri, May 1, 2026 at 12:59 PM Tony Tang via dev <
> > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Kevin,
> > > > > > > >
> > > > > > > > Thanks for the reply. Very insightful points.
> > > > > > > >
> > > > > > > > KW1: Yes, using a single tagged metric makes sense. it's
> > cleaner
> > > > and
> > > > > > more
> > > > > > > > extensible. I'll adopt this approach.
> > > > > > > >
> > > > > > > > KW2: Yes, we don't need to use `CumulativeCount`. Already
> > updated
> > > > in
> > > > > > the
> > > > > > > > KIP
> > > > > > > >
> > > > > > > > KW3: I understand each timer is only meaningful in certain
> > > states,
> > > > > but
> > > > > > > the
> > > > > > > > metric value is still useful for operational monitoring
> > > regardless
> > > > of
> > > > > > the
> > > > > > > > current state. It tells you how many times a timeout has
> > expired
> > > > over
> > > > > > the
> > > > > > > > lifetime of the node. Hiding or clearing the metric when the
> > node
> > > > > isn't
> > > > > > > in
> > > > > > > > the relevant state could actually make it harder for users to
> > > > > diagnose
> > > > > > > > historical issues, since they'd need to catch the metric
> while
> > > the
> > > > > node
> > > > > > > > happens to be in the right state. For example, if a follower
> > had
> > > > > > repeated
> > > > > > > > fetch timeout expirations and then transitions to a
> > > > candidate/leader,
> > > > > > the
> > > > > > > > metrics would still be valuable for diagnosing why the leader
> > > > > election
> > > > > > > > happened in the first place, right? If we cleared the metric
> on
> > > > state
> > > > > > > > transition, that information would be lost. The question is :
> > Do
> > > we
> > > > > > only
> > > > > > > > want the metric to reflect only the latest state, or the
> > overall
> > > > > > timeout
> > > > > > > > behavior over the node's lifetime? I lean toward the latter,
> as
> > > it
> > > > > > > provides
> > > > > > > > more useful information for monitoring network issues. To
> avoid
> > > > > > > confusion,
> > > > > > > > maybe we can use the metric name lifetime-timeout-count + tag
> > > > > > > > timer-name=fetch/election? What do you think?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Apr 30, 2026 at 3:03 PM Kevin Wu <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Tony,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. I agree that having metrics for
> timeouts
> > in
> > > > > KRaft
> > > > > > > > would
> > > > > > > > > be a nice addition. I have a few high level comments about
> > the
> > > > KIP:
> > > > > > > > >
> > > > > > > > > KW1: Did you consider making a tagged metric like
> > > > > > `number-of-timeouts`
> > > > > > > > > instead of individual metrics? You could tag by the timer
> > name
> > > > > (e.g.
> > > > > > > > fetch,
> > > > > > > > > election, update-voter, check-quorum, and
> begin-quorum-epoch
> > > > etc.)
> > > > > > > since
> > > > > > > > > KRaft supports several kinds of timers, and may add more in
> > the
> > > > > > future.
> > > > > > > > You
> > > > > > > > > can look at `NodeMetrics.java` and
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180*3A*Add*generic*feature*level*metrics__;JSsrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQFYUcvow$
> > > > > > > > > for an example of tagged metrics using Kafka's new metrics
> > > > > library. I
> > > > > > > > think
> > > > > > > > > there is an argument we should add timeout metrics for some
> > of
> > > > > these
> > > > > > > > other
> > > > > > > > > KRaft timers I mentioned, since reporting them could also
> > help
> > > > > > > operators
> > > > > > > > > diagnose network partitions or possible software bugs.
> > > > > > > > >
> > > > > > > > > KW2: I see the "Type" for each metric is
> `CumulativeCount`. I
> > > > think
> > > > > > > this
> > > > > > > > > might be overkill, and that we could just use Integer for
> the
> > > > data
> > > > > > > type,
> > > > > > > > > and expose an increment method for each metric. In general,
> > > > sensors
> > > > > > are
> > > > > > > > > used for when multiple metrics are associated with a
> specific
> > > > > concept
> > > > > > > > (e.g.
> > > > > > > > > `commit-latency-avg` and `commit-latency-max` are two
> > different
> > > > > > metrics
> > > > > > > > > associated with the same concept of "commit latency"). It
> is
> > > hard
> > > > > for
> > > > > > > me
> > > > > > > > to
> > > > > > > > > imagine that the number of timeouts occurring would have
> more
> > > > than
> > > > > > one
> > > > > > > > > metric associated with it.
> > > > > > > > >
> > > > > > > > > KW3: Each of these timers is associated with an EpochState
> > > (e.g.
> > > > > the
> > > > > > > > fetch
> > > > > > > > > timer with FollowerState, check quorum timer with
> > LeaderState,
> > > > > etc.).
> > > > > > > > What
> > > > > > > > > should the value of these metrics be when a node
> transitions
> > > > > between
> > > > > > > > > EpochStates? Should we stop reporting the metrics
> associated
> > > with
> > > > > the
> > > > > > > old
> > > > > > > > > EpochState, and start reporting the metrics associated with
> > the
> > > > new
> > > > > > > > > EpochState? I personally think it might be confusing if
> these
> > > > > metrics
> > > > > > > > > report values even if the underlying timer does not exist
> on
> > > the
> > > > > > node.
> > > > > > > > For
> > > > > > > > > example, the fetch timeout metric reporting a value when
> the
> > > > local
> > > > > > node
> > > > > > > > is
> > > > > > > > > the KRaft leader seems odd to me. When we added metrics for
> > > > KIP-853
> > > > > > > > > associated with the leader (e.g.
> `uncommitted-voter-change`),
> > > we
> > > > > > > decided
> > > > > > > > to
> > > > > > > > > only report values for those metrics when the local node
> was
> > > the
> > > > > > > leader.
> > > > > > > > It
> > > > > > > > > would be nice if we could follow that convention for these
> > > > metrics
> > > > > > too,
> > > > > > > > and
> > > > > > > > > document which states report which metrics in the KIP. What
> > do
> > > > you
> > > > > > > think?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Kevin Wu
> > > > > > > > >
> > > > > > > > > On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello everyone,
> > > > > > > > > >
> > > > > > > > > > I'd like to start a discussion on KIP-1322: Add metrics
> to
> > > > Kraft
> > > > > > that
> > > > > > > > > > measure the number of fetch timeouts and election
> timeouts
> > <
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322*3A*Add*metrics*to*Kraft*that*measure*the*number*of*fetch*timeouts*and*election*timeouts__;JSsrKysrKysrKysrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQLt1GBmw$
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This proposal aims to add new metrics to KRaft that track
> > how
> > > > > often
> > > > > > > > fetch
> > > > > > > > > > timeouts and election timeouts occur.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Tony Tang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Reply via email to