Hey Lucas, Thinking about the original intent of the KIP, I agree it would be better to drop the DEBUG metrics suggestion and have it as follow-on work.
-Bill On Tue, Sep 30, 2025 at 4:01 AM Lucas Brutschy <[email protected]> wrote: > Hi Matthias / Bill, > > it's a good point that there is an overlap between the debug metrics > and task-created-rate/task-created-total. > > I wonder if we are overloading this KIP with the DEBUG metrics that > Bill suggested. The main point of the KIP is to capture the latency of > revoking tasks and handling assignment of tasks. Understanding the > latency of revoking / assigning tasks inside streams is important to > understand the duration of rebalances. > > Capturing number of tasks assigned / revoked / lost, deprecating > task-closed/task-closed-total metrics etc., and potentially including > revoked / added tasks IDs into metrics, all sound useful but a bit > orthogonal to the original point of the KIP. Could we leave this to > future work? > > Cheers, > Lucas > > On Tue, Sep 30, 2025 at 2:02 AM Kirk True <[email protected]> wrote: > > > > Hi Travis, > > > > Thanks for the KIP! > > > > No comments on the KIP, per se, but I'm glad I read it because I don't > remember ever hearing about metrics recording levels before. I'll > definitely plan to put those into use for some of the metrics we added > recently to the consumer. > > > > Thanks, > > Kirk > > > > On Mon, Sep 29, 2025, at 11:24 AM, Matthias J. Sax wrote: > > > Thanks for the KIP Travis. > > > > > > The 3 new latency metrics sound very useful. > > > > > > For the 4 new debug metrics: they sound somewhat redundant to existing > > > metrics: > > > - task-created-rate > > > - task-created-total > > > - task-closed-rate > > > - task-closed-total > > > > > > Are you aware that these metrics already exist? I don't see, why they > > > would not work if "streams" rebalance protocol gets used? > > > > > > Btw: I was always wondering about the usefulness of the two `-total` > > > metrics? How is it useful to know, for a long running application, how > > > many tasks got created or closed during the whole lifetime of a > > > StreamsThread? > > > > > > It could be useful though, to know the number of created/revoked/lost > > > task of the last rebalance, ie, we would use a gauge instead of a sum > > > metric? > > > > > > Splitting out active/standby/warmup as proposed by Lucas sounds useful, > > > too. So maybe we could deprecate the existing metrics, and replace with > > > better one? > > > > > > What is the reason to split out active/standby (and warmup) for the > > > "assigned" case, but not the revoked or lost case? > > > > > > > > > I don't think we should add task-ids to metrics personally. If users > > > need to access this information, it might be better to add some > > > callback/listener they can register on `KafkaStreams` -- but even for > > > this, I am not sure how useful it would be? Was any user reporting that > > > it would be useful? > > > > > > > > > -Matthias > > > > > > On 9/16/25 2:14 AM, Lucas Brutschy wrote: > > > > Hi Travis, > > > > > > > > thanks for the KIP! > > > > > > > > Looks good to me. I'm not sure, we need the DEBUG metrics, but we can > > > > add them. I would, however, already also include warm-up tasks in the > > > > metrics, if you are including active / standby. Furthermore, I also > > > > wasn't sure if Bill wanted to add the number of tasks or the actual > > > > task IDs to the DEBUG metrics. Bill, maybe you can comment on that. > > > > > > > > I think after hashing out these finer points about the DEBUG metrics, > > > > we can open a vote thread. > > > > > > > > Cheers, > > > > Lucas > > > > > > > > On Mon, Sep 15, 2025 at 6:38 AM Travis Zhang > > > > <[email protected]> wrote: > > > >> > > > >> Hi Bill, > > > >> > > > >> Thanks for your feedback. It does make sense to me. I've added the > > > >> total task count metrics to the KIP at DEBUG level! > > > >> > > > >> Best, > > > >> Travis > > > >> > > > >> On Fri, Sep 12, 2025 at 11:44 AM Bill Bejeck <[email protected]> > wrote: > > > >>> > > > >>> Hi Travis, > > > >>> > > > >>> Thanks for the KIP! It looks like a useful addition in support of > KIP-1017. > > > >>> Overall the KIP LGTM, but I have a follow-up question. > > > >>> > > > >>> Would we want to consider an additional metric displaying the tasks > > > >>> involved in each of the revoked, assigned, and lost events? > > > >>> This would probably be best at the DEBUG level. > > > >>> Certainly this is an optional suggestion, but I do feel it would > be a > > > >>> valuable aid to operators of KS applications. > > > >>> > > > >>> Regards, > > > >>> Bill > > > >>> > > > >>> On Fri, Sep 12, 2025 at 12:04 AM Travis Zhang > <[email protected]> > > > >>> wrote: > > > >>> > > > >>>> Hey Alieh, > > > >>>> > > > >>>> Thanks for the great questions and the thoughtful feedback on the > KIP! > > > >>>> > > > >>>> Good call on adding the code snippets—I'll get the key class > > > >>>> structures into the KIP to make it fully self-contained. > > > >>>> > > > >>>> You raised some excellent points on the metrics strategy. Here’s > my > > > >>>> thinking on them: > > > >>>> > > > >>>> 1. Why Thread-Level Metrics: > > > >>>> > > > >>>> We opted for thread-level reporting for two main reasons: > > > >>>> debuggability and consistency. When a rebalance gets stuck, > operators > > > >>>> need to pinpoint exactly which StreamThread is the bottleneck, as > each > > > >>>> one can have a very different workload. This approach also aligns > with > > > >>>> all other core metrics (like process-latency), which are already > > > >>>> scoped to the thread. > > > >>>> > > > >>>> While it is possible to add application-level aggregates, they > > > >>>> wouldn't offer new insights since any application-wide issue will > > > >>>> always show up in one or more threads. I felt this gives > operators the > > > >>>> most diagnostic power without adding noise. > > > >>>> > > > >>>> 2. Avg/Max vs. Percentiles: > > > >>>> > > > >>>> On using avg/max, I think avg/max is good for now, mainly because > of > > > >>>> the nature of rebalances. They're infrequent but high-impact > events. > > > >>>> Unlike a constant stream of processing operations, a single slow > > > >>>> rebalance is the production issue, making max latency the most > > > >>>> critical signal for an operator. > > > >>>> > > > >>>> Percentiles are less statistically meaningful for such > low-frequency > > > >>>> events and introduce a memory overhead we'd like to avoid > initially. > > > >>>> > > > >>>> We can definitely consider adding percentiles in a future KIP if > we > > > >>>> find avg/max isn't sufficient once this is in production. > > > >>>> > > > >>>> Let me know if this reasoning makes sense. Happy to discuss it > more! > > > >>>> > > > >>>> Best, > > > >>>> Travis > > > >>>> > > > >>>> > > > >>>> On Thu, Sep 11, 2025 at 8:53 AM Alieh Saeedi > > > >>>> <[email protected]> wrote: > > > >>>>> > > > >>>>> Hey Travis > > > >>>>> > > > >>>>> Thanks for sharing the KIP. > > > >>>>> > > > >>>>> One suggestion (not essential): would it be possible to include > the > > > >>>>> relevant code snippet and the new class directly in the KIP in > `Proposed > > > >>>>> Changes` section? That way, everything is self-contained and > there’s no > > > >>>>> need to switch between the KIP and the codebase. > > > >>>>> I understand that you’re incorporating the existing metrics from > the old > > > >>>>> protocol into the new one, with the goal of maintaining > consistency in > > > >>>> the > > > >>>>> metrics provided. However, I still have a few questions that > might be > > > >>>> best > > > >>>>> addressed here, as this seems like the ideal time to raise them > and > > > >>>>> reconsider our approach. > > > >>>>> - > > > >>>>> > > > >>>>> 1. Why are the new metrics being recorded at the thread level > > > >>>> exclusively? > > > >>>>> Would there be value in exposing these metrics at additional > levels (such > > > >>>>> as application), especially for operators managing large > topologies? > > > >>>>> - > > > >>>>> > > > >>>>> 2. Are the chosen latency metrics—average and max—sufficient for > > > >>>> diagnosing > > > >>>>> issues in production, or should more granular statistics (e.g., > > > >>>> percentile > > > >>>>> latencies) be considered to improve observability? > > > >>>>> > > > >>>>> Let me know your thoughts! > > > >>>>> > > > >>>>> > > > >>>>> Bests, > > > >>>>> Alieh > > > >>>>> > > > >>>>> On Wed, Sep 10, 2025 at 7:38 PM Travis Zhang > <[email protected] > > > >>>>> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> Hi, > > > >>>>>> > > > >>>>>> I'd like to start a discussion on > > > >>>>>> KIP-1216: > > > >>>>>> > > > >>>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1216%3A+Add+rebalance+listener+metrics+for+Kafka+Streams > > > >>>>>> > > > >>>>>> This KIP proposes adding latency metrics for each rebalance > callback > > > >>>>>> to provide operators with the observability needed to > effectively > > > >>>>>> monitor and optimize Kafka Streams applications in production > > > >>>>>> environments. > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> Travis > > > >>>>>> > > > >>>> > > > > > > >
