Re: [DISCUSS] KIP-1216: Add rebalance listener metrics for Kafka Streams

Bill Bejeck Tue, 30 Sep 2025 06:45:04 -0700

Hey Lucas,

Thinking about the original intent of the KIP, I agree it would be better
to drop the DEBUG metrics suggestion and have it as follow-on work.


-Bill

On Tue, Sep 30, 2025 at 4:01 AM Lucas Brutschy
<[email protected]> wrote:

> Hi Matthias / Bill,
>
> it's a good point that there is an overlap between the debug metrics
> and task-created-rate/task-created-total.
>
> I wonder if we are overloading this KIP with the DEBUG metrics that
> Bill suggested. The main point of the KIP is to capture the latency of
> revoking tasks and handling assignment of tasks. Understanding the
> latency of revoking / assigning tasks inside streams is important to
> understand the duration of rebalances.
>
> Capturing number of tasks assigned / revoked / lost, deprecating
> task-closed/task-closed-total metrics etc., and potentially including
> revoked / added tasks IDs into metrics, all sound useful but a bit
> orthogonal to the original point of the KIP. Could we leave this to
> future work?
>
> Cheers,
> Lucas
>
> On Tue, Sep 30, 2025 at 2:02 AM Kirk True <[email protected]> wrote:
> >
> > Hi Travis,
> >
> > Thanks for the KIP!
> >
> > No comments on the KIP, per se, but I'm glad I read it because I don't
> remember ever hearing about metrics recording levels before. I'll
> definitely plan to put those into use for some of the metrics we added
> recently to the consumer.
> >
> > Thanks,
> > Kirk
> >
> > On Mon, Sep 29, 2025, at 11:24 AM, Matthias J. Sax wrote:
> > > Thanks for the KIP Travis.
> > >
> > > The 3 new latency metrics sound very useful.
> > >
> > > For the 4 new debug metrics: they sound somewhat redundant to existing
> > > metrics:
> > >   - task-created-rate
> > >   - task-created-total
> > >   - task-closed-rate
> > >   - task-closed-total
> > >
> > > Are you aware that these metrics already exist? I don't see, why they
> > > would not work if "streams" rebalance protocol gets used?
> > >
> > > Btw: I was always wondering about the usefulness of the two `-total`
> > > metrics? How is it useful to know, for a long running application, how
> > > many tasks got created or closed during the whole lifetime of a
> > > StreamsThread?
> > >
> > > It could be useful though, to know the number of created/revoked/lost
> > > task of the last rebalance, ie, we would use a gauge instead of a sum
> > > metric?
> > >
> > > Splitting out active/standby/warmup as proposed by Lucas sounds useful,
> > > too. So maybe we could deprecate the existing metrics, and replace with
> > > better one?
> > >
> > > What is the reason to split out active/standby (and warmup) for the
> > > "assigned" case, but not the revoked or lost case?
> > >
> > >
> > > I don't think we should add task-ids to metrics personally. If users
> > > need to access this information, it might be better to add some
> > > callback/listener they can register on `KafkaStreams` -- but even for
> > > this, I am not sure how useful it would be? Was any user reporting that
> > > it would be useful?
> > >
> > >
> > > -Matthias
> > >
> > > On 9/16/25 2:14 AM, Lucas Brutschy wrote:
> > > > Hi Travis,
> > > >
> > > > thanks for the KIP!
> > > >
> > > > Looks good to me. I'm not sure, we need the DEBUG metrics, but we can
> > > > add them. I would, however, already also include warm-up tasks in the
> > > > metrics, if you are including active / standby. Furthermore, I also
> > > > wasn't sure if Bill wanted to add the number of tasks or the actual
> > > > task IDs to the DEBUG metrics. Bill, maybe you can comment on that.
> > > >
> > > > I think after hashing out these finer points about the DEBUG metrics,
> > > > we can open a vote thread.
> > > >
> > > > Cheers,
> > > > Lucas
> > > >
> > > > On Mon, Sep 15, 2025 at 6:38 AM Travis Zhang
> > > > <[email protected]> wrote:
> > > >>
> > > >> Hi Bill,
> > > >>
> > > >> Thanks for your feedback. It does make sense to me. I've added the
> > > >> total task count metrics to the KIP at DEBUG level!
> > > >>
> > > >> Best,
> > > >> Travis
> > > >>
> > > >> On Fri, Sep 12, 2025 at 11:44 AM Bill Bejeck <[email protected]>
> wrote:
> > > >>>
> > > >>> Hi Travis,
> > > >>>
> > > >>> Thanks for the KIP! It looks like a useful addition in support of
> KIP-1017.
> > > >>> Overall the KIP LGTM, but I have a follow-up question.
> > > >>>
> > > >>> Would we want to consider an additional metric displaying the tasks
> > > >>> involved in each of the revoked, assigned, and lost events?
> > > >>> This would probably be best at the DEBUG level.
> > > >>> Certainly this is an optional suggestion, but I do feel it would
> be a
> > > >>> valuable aid to operators of KS applications.
> > > >>>
> > > >>> Regards,
> > > >>> Bill
> > > >>>
> > > >>> On Fri, Sep 12, 2025 at 12:04 AM Travis Zhang
> <[email protected]>
> > > >>> wrote:
> > > >>>
> > > >>>> Hey Alieh,
> > > >>>>
> > > >>>> Thanks for the great questions and the thoughtful feedback on the
> KIP!
> > > >>>>
> > > >>>> Good call on adding the code snippets—I'll get the key class
> > > >>>> structures into the KIP to make it fully self-contained.
> > > >>>>
> > > >>>> You raised some excellent points on the metrics strategy. Here’s
> my
> > > >>>> thinking on them:
> > > >>>>
> > > >>>> 1. Why Thread-Level Metrics:
> > > >>>>
> > > >>>> We opted for thread-level reporting for two main reasons:
> > > >>>> debuggability and consistency. When a rebalance gets stuck,
> operators
> > > >>>> need to pinpoint exactly which StreamThread is the bottleneck, as
> each
> > > >>>> one can have a very different workload. This approach also aligns
> with
> > > >>>> all other core metrics (like process-latency), which are already
> > > >>>> scoped to the thread.
> > > >>>>
> > > >>>> While it is possible to add application-level aggregates, they
> > > >>>> wouldn't offer new insights since any application-wide issue will
> > > >>>> always show up in one or more threads. I felt this gives
> operators the
> > > >>>> most diagnostic power without adding noise.
> > > >>>>
> > > >>>> 2. Avg/Max vs. Percentiles:
> > > >>>>
> > > >>>> On using avg/max, I think avg/max is good for now, mainly because
> of
> > > >>>> the nature of rebalances. They're infrequent but high-impact
> events.
> > > >>>> Unlike a constant stream of processing operations, a single slow
> > > >>>> rebalance is the production issue, making max latency the most
> > > >>>> critical signal for an operator.
> > > >>>>
> > > >>>> Percentiles are less statistically meaningful for such
> low-frequency
> > > >>>> events and introduce a memory overhead we'd like to avoid
> initially.
> > > >>>>
> > > >>>> We can definitely consider adding percentiles in a future KIP if
> we
> > > >>>> find avg/max isn't sufficient once this is in production.
> > > >>>>
> > > >>>> Let me know if this reasoning makes sense. Happy to discuss it
> more!
> > > >>>>
> > > >>>> Best,
> > > >>>> Travis
> > > >>>>
> > > >>>>
> > > >>>> On Thu, Sep 11, 2025 at 8:53 AM Alieh Saeedi
> > > >>>> <[email protected]> wrote:
> > > >>>>>
> > > >>>>> Hey Travis
> > > >>>>>
> > > >>>>> Thanks for sharing the KIP.
> > > >>>>>
> > > >>>>> One suggestion (not essential): would it be possible to include
> the
> > > >>>>> relevant code snippet and the new class directly in the KIP in
> `Proposed
> > > >>>>> Changes` section? That way, everything is self-contained and
> there’s no
> > > >>>>> need to switch between the KIP and the codebase.
> > > >>>>> I understand that you’re incorporating the existing metrics from
> the old
> > > >>>>> protocol into the new one, with the goal of maintaining
> consistency in
> > > >>>> the
> > > >>>>> metrics provided. However, I still have a few questions that
> might be
> > > >>>> best
> > > >>>>> addressed here, as this seems like the ideal time to raise them
> and
> > > >>>>> reconsider our approach.
> > > >>>>> -
> > > >>>>>
> > > >>>>> 1. Why are the new metrics being recorded at the thread level
> > > >>>> exclusively?
> > > >>>>> Would there be value in exposing these metrics at additional
> levels (such
> > > >>>>> as application), especially for operators managing large
> topologies?
> > > >>>>> -
> > > >>>>>
> > > >>>>> 2. Are the chosen latency metrics—average and max—sufficient for
> > > >>>> diagnosing
> > > >>>>> issues in production, or should more granular statistics (e.g.,
> > > >>>> percentile
> > > >>>>> latencies) be considered to improve observability?
> > > >>>>>
> > > >>>>> Let me know your thoughts!
> > > >>>>>
> > > >>>>>
> > > >>>>> Bests,
> > > >>>>> Alieh
> > > >>>>>
> > > >>>>> On Wed, Sep 10, 2025 at 7:38 PM Travis Zhang
> <[email protected]
> > > >>>>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> I'd like to start a discussion on
> > > >>>>>> KIP-1216:
> > > >>>>>>
> > > >>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1216%3A+Add+rebalance+listener+metrics+for+Kafka+Streams
> > > >>>>>>
> > > >>>>>> This KIP proposes adding latency metrics for each rebalance
> callback
> > > >>>>>> to provide operators with the observability needed to
> effectively
> > > >>>>>> monitor and optimize Kafka Streams applications in production
> > > >>>>>> environments.
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>> Travis
> > > >>>>>>
> > > >>>>
> > >
> > >
>

Re: [DISCUSS] KIP-1216: Add rebalance listener metrics for Kafka Streams

Reply via email to