Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Tony Tang via dev Wed, 13 May 2026 08:52:45 -0700

Hi Kevin,

Thanks for the feedback.
Re KW 5,6,7: I agree with all three. I'll update the KIP
Re KW 8:
Case 1: The the poll methods in KafkaRaftClient actively calls isExpired()
on the timer, and when it finds the timer has expired, it increments the
counter. The counting logic lives in the caller
Case 2: The timer itself is aware of its expiration and automatically
increments the counter when it transitions from not expired to expired. The
counting logic lives inside the timer.
Is that the cases you're talking about?


Best,
Tony


On Tue, May 12, 2026 at 5:42 PM Kevin Wu <[email protected]> wrote:

> Hi Tony,
>
> Thanks for the updates to the KIP. I have a few minor comments:
>
> KW5: nit, but can we say “this metric is registered only…” instead of
> “Registered only…” in the description column?
>
> KW6: What do you think about renaming the metric to “timeout-expirations”?
> This would be slightly less verbose than what we have now.
>
> KW7: Instead of having separate columns for the metric tag and name, can we
> have one column with the entire MBean name? For example:
>
> “kafka.server:type=node-metrics,name=maximum-supported-level,feature-name=X”.
>
> KW8: Something to consider in determining when these metric values should
> be incremented: should  the value be incremented when the raft client reads
> the timer and finds it has expired? Or, should the value be incremented
> only when the timer goes from not expired to expired? The first is simple
> but means the metric value is slightly different than what we’ve described
> in the KIP. I feel like implementing the second may be pretty complex…
>
> Best,
> Kevin Wu
>
> On Fri, May 8, 2026 at 10:06 AM Tony Tang via dev <[email protected]>
> wrote:
>
> > Hi Justine,
> >
> > Thanks for the feedback!
> > I looked through the Kafka codebase and found a couple of existing
> timeout
> > metrics:
> >
> > 1.worker-poll-timeout-count in Kafka Connect, which is a cumulative count
> > of poll timeouts, implemented as a Gauge using an AtomicLong counter.
> This
> > is the closest to what we're proposing.
> > 2. AcquisitionLockTimeoutPerSec in Share Groups, which is a Meter that
> > tracks the rate of acquisition lock timeouts.
> >
> > I think our KIP follows a similar approach to worker-poll-timeout-count
> >
> > Thanks,
> > Tony
> >
> > On Thu, May 7, 2026 at 5:21 PM Justine Olshan via dev <
> > [email protected]>
> > wrote:
> >
> > > Hey Tony,
> > >
> > > Thanks for the KIP. Overall, I think the idea makes sense and it seems
> > like
> > > you and Kevin are getting closer to agreement on the exact definition
> of
> > > the metrics.
> > > I was curious, are there any other timeout metrics you can find in
> Kafka
> > > and how are those defined? We don't necessarily need to do the same,
> but
> > > was curious if there was any precedent for this type of metric.
> > >
> > > Thanks,
> > > Justine
> > >
> > > On Mon, May 4, 2026 at 9:25 AM Tony Tang via dev <[email protected]
> >
> > > wrote:
> > >
> > > > Hi Kevin,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > KW3: Ok, I agree that metric values at time X should be an accurate
> > > > snapshot of the node at time X, and that collecting historical values
> > is
> > > > not the responsibility of kafka. The approach you described makes
> sense
> > > to
> > > > me: we keep the internal counter across state transitions, but only
> > > > register/unregister the metric based on whether the underlying timer
> > > > exists. I'm on board with that approach
> > > >
> > > > KW4: That's a great suggestion. However, I think it's out of scope
> for
> > > this
> > > > KIP. I'd prefer to keep this KIP focused on timeout expiration
> metrics
> > > and
> > > > maybe we can discuss the state transition counting in the future.
> > > >
> > > > Best,
> > > > Tony
> > > >
> > > > On Fri, May 1, 2026 at 5:12 PM Kevin Wu <[email protected]>
> > wrote:
> > > >
> > > > > Hi Tony,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > KW3: I don't think "count" metrics like the ones being discussed in
> > > this
> > > > > KIP should report values when the objects they are associated with
> do
> > > not
> > > > > exist. This would mean metric values at time X are not an accurate
> > > > > "snapshot" of the Kafka node at time X. In my opinion, collecting
> the
> > > > > historic values for a metric, visualizing them through dashboards,
> > and
> > > > > monitoring them to alert operators are the responsibilities of a
> > > > downstream
> > > > > observability software, not Kafka. Kafka does have the capability
> to
> > > > create
> > > > > "derivative" metrics (i.e. metrics whose values are based off of
> > > sampling
> > > > > something else over time) via Sensors, but I don't think that fits
> > our
> > > > use
> > > > > case as previously discussed.
> > > > >
> > > > > Another way to think about it is that adding or removing a metric
> so
> > > > Kafka
> > > > > starts or stops reporting a value to an observability service
> > actually
> > > > > tells the user more information about Kafka compared to
> > unconditionally
> > > > > reporting said metric value. Additionally, just because you remove
> a
> > > > metric
> > > > > does not mean you need to remove the corresponding counter within
> the
> > > > > KafkaRaftMetrics object. For example, a node fetch request times
> out
> > 5
> > > > > times as a follower, so the node reports 5 for the metric. Next,
> the
> > > node
> > > > > becomes the leader, so it stops reporting the metric (i.e. the
> metric
> > > has
> > > > > no value). Then it becomes a follower, and starts reporting 5
> again.
> > > When
> > > > > the next fetch timeout happens, the node reports 6 for the metric.
> > What
> > > > do
> > > > > you think about this behavior?
> > > > >
> > > > > KW4: If the desire is for a metric that always reports a value,
> what
> > do
> > > > you
> > > > > think about a metric that counts the number of `EpochState`
> > > transitions?
> > > > I
> > > > > think this value makes sense to report this value for the lifetime
> > of a
> > > > > process, and generally, frequent state transitions are an
> indication
> > > > > something is wrong with the cluster. This would be an additional
> > metric
> > > > to
> > > > > the ones we discussed above.
> > > > >
> > > > > Best,
> > > > > Kevin Wu
> > > > >
> > > > > On Fri, May 1, 2026 at 12:59 PM Tony Tang via dev <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > Thanks for the reply. Very insightful points.
> > > > > >
> > > > > > KW1: Yes, using a single tagged metric makes sense. it's cleaner
> > and
> > > > more
> > > > > > extensible. I'll adopt this approach.
> > > > > >
> > > > > > KW2: Yes, we don't need to use `CumulativeCount`. Already updated
> > in
> > > > the
> > > > > > KIP
> > > > > >
> > > > > > KW3: I understand each timer is only meaningful in certain
> states,
> > > but
> > > > > the
> > > > > > metric value is still useful for operational monitoring
> regardless
> > of
> > > > the
> > > > > > current state. It tells you how many times a timeout has expired
> > over
> > > > the
> > > > > > lifetime of the node. Hiding or clearing the metric when the node
> > > isn't
> > > > > in
> > > > > > the relevant state could actually make it harder for users to
> > > diagnose
> > > > > > historical issues, since they'd need to catch the metric while
> the
> > > node
> > > > > > happens to be in the right state. For example, if a follower had
> > > > repeated
> > > > > > fetch timeout expirations and then transitions to a
> > candidate/leader,
> > > > the
> > > > > > metrics would still be valuable for diagnosing why the leader
> > > election
> > > > > > happened in the first place, right? If we cleared the metric on
> > state
> > > > > > transition, that information would be lost. The question is : Do
> we
> > > > only
> > > > > > want the metric to reflect only the latest state, or the overall
> > > > timeout
> > > > > > behavior over the node's lifetime? I lean toward the latter, as
> it
> > > > > provides
> > > > > > more useful information for monitoring network issues. To avoid
> > > > > confusion,
> > > > > > maybe we can use the metric name lifetime-timeout-count + tag
> > > > > > timer-name=fetch/election? What do you think?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 30, 2026 at 3:03 PM Kevin Wu <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Tony,
> > > > > > >
> > > > > > > Thanks for the KIP. I agree that having metrics for timeouts in
> > > KRaft
> > > > > > would
> > > > > > > be a nice addition. I have a few high level comments about the
> > KIP:
> > > > > > >
> > > > > > > KW1: Did you consider making a tagged metric like
> > > > `number-of-timeouts`
> > > > > > > instead of individual metrics? You could tag by the timer name
> > > (e.g.
> > > > > > fetch,
> > > > > > > election, update-voter, check-quorum, and begin-quorum-epoch
> > etc.)
> > > > > since
> > > > > > > KRaft supports several kinds of timers, and may add more in the
> > > > future.
> > > > > > You
> > > > > > > can look at `NodeMetrics.java` and
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180*3A*Add*generic*feature*level*metrics__;JSsrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQFYUcvow$
> > > > > > > for an example of tagged metrics using Kafka's new metrics
> > > library. I
> > > > > > think
> > > > > > > there is an argument we should add timeout metrics for some of
> > > these
> > > > > > other
> > > > > > > KRaft timers I mentioned, since reporting them could also help
> > > > > operators
> > > > > > > diagnose network partitions or possible software bugs.
> > > > > > >
> > > > > > > KW2: I see the "Type" for each metric is `CumulativeCount`. I
> > think
> > > > > this
> > > > > > > might be overkill, and that we could just use Integer for the
> > data
> > > > > type,
> > > > > > > and expose an increment method for each metric. In general,
> > sensors
> > > > are
> > > > > > > used for when multiple metrics are associated with a specific
> > > concept
> > > > > > (e.g.
> > > > > > > `commit-latency-avg` and `commit-latency-max` are two different
> > > > metrics
> > > > > > > associated with the same concept of "commit latency"). It is
> hard
> > > for
> > > > > me
> > > > > > to
> > > > > > > imagine that the number of timeouts occurring would have more
> > than
> > > > one
> > > > > > > metric associated with it.
> > > > > > >
> > > > > > > KW3: Each of these timers is associated with an EpochState
> (e.g.
> > > the
> > > > > > fetch
> > > > > > > timer with FollowerState, check quorum timer with LeaderState,
> > > etc.).
> > > > > > What
> > > > > > > should the value of these metrics be when a node transitions
> > > between
> > > > > > > EpochStates? Should we stop reporting the metrics associated
> with
> > > the
> > > > > old
> > > > > > > EpochState, and start reporting the metrics associated with the
> > new
> > > > > > > EpochState? I personally think it might be confusing if these
> > > metrics
> > > > > > > report values even if the underlying timer does not exist on
> the
> > > > node.
> > > > > > For
> > > > > > > example, the fetch timeout metric reporting a value when the
> > local
> > > > node
> > > > > > is
> > > > > > > the KRaft leader seems odd to me. When we added metrics for
> > KIP-853
> > > > > > > associated with the leader (e.g. `uncommitted-voter-change`),
> we
> > > > > decided
> > > > > > to
> > > > > > > only report values for those metrics when the local node was
> the
> > > > > leader.
> > > > > > It
> > > > > > > would be nice if we could follow that convention for these
> > metrics
> > > > too,
> > > > > > and
> > > > > > > document which states report which metrics in the KIP. What do
> > you
> > > > > think?
> > > > > > >
> > > > > > > Best,
> > > > > > > Kevin Wu
> > > > > > >
> > > > > > > On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev <
> > > > > [email protected]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello everyone,
> > > > > > > >
> > > > > > > > I'd like to start a discussion on KIP-1322: Add metrics to
> > Kraft
> > > > that
> > > > > > > > measure the number of fetch timeouts and election timeouts <
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322*3A*Add*metrics*to*Kraft*that*measure*the*number*of*fetch*timeouts*and*election*timeouts__;JSsrKysrKysrKysrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQLt1GBmw$
> > > > > > > > >
> > > > > > > >
> > > > > > > > This proposal aims to add new metrics to KRaft that track how
> > > often
> > > > > > fetch
> > > > > > > > timeouts and election timeouts occur.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Tony Tang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1322: Add metrics to Kraft that measure the number of fetch timeouts and election timeouts

Reply via email to