Hi, Andrew, Thanks for the KIP. +1 from me too.
Jun On Wed, Oct 11, 2023 at 4:00 PM Sophie Blee-Goldman <sop...@responsive.dev> wrote: > This looks great! +1 (binding) > > Sophie > > On Wed, Oct 11, 2023 at 1:46 PM Matthias J. Sax <mj...@apache.org> wrote: > > > +1 (binding) > > > > On 9/13/23 5:48 PM, Jason Gustafson wrote: > > > Hey Andrew, > > > > > > +1 on the KIP. For many users of Kafka, it may not be fully understood > > how > > > much of a challenge client monitoring is. With tens of clients in a > > > cluster, it is already difficult to coordinate metrics collection. When > > > there are thousands of clients, and when the cluster operator has no > > > control over them, it is essentially impossible. For the fat clients > that > > > we have, the lack of useful telemetry is a huge operational gap. > > > Consistency between clients has also been a major challenge. I think > the > > > effort toward standardization in this KIP will have some positive > impact > > > even in deployments which have effective client-side monitoring. > > Overall, I > > > think this proposal will provide a lot of value across the board. > > > > > > Best, > > > Jason > > > > > > On Wed, Sep 13, 2023 at 9:50 AM Philip Nee <philip...@gmail.com> > wrote: > > > > > >> Hey Andrew - > > >> > > >> Thank you for taking the time to reply to my questions. I'm just > adding > > >> some notes to this discussion. > > >> > > >> 1. epoch: It can be helpful to know the delta of the client side and > the > > >> actual leader epoch. It is helpful to understand why sometimes commit > > >> fails/client not making progress. > > >> 2. Client connection: If the client selects the "wrong" connection to > > push > > >> out the data, I assume the request would timeout; which should lead to > > >> disconnecting from the node and reselecting another node as you > > mentioned, > > >> via the least loaded node. > > >> > > >> Cheers, > > >> P > > >> > > >> > > >> On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield < > > >> andrew_schofield_j...@outlook.com> wrote: > > >> > > >>> Hi Philip, > > >>> Thanks for your vote and interest in the KIP. > > >>> > > >>> KIP-714 does not introduce any new client metrics, and that’s > > >> intentional. > > >>> It does > > >>> tell how that all of the client metrics can have their names > > transformed > > >>> into > > >>> equivalent "telemetry metric names”, and then potentially used in > > metrics > > >>> subscriptions. > > >>> > > >>> I am interested in the idea of client’s leader epoch in this context, > > but > > >>> I don’t have > > >>> an immediate plan for how best to do this, and it would take another > > KIP > > >>> to enhance > > >>> existing metrics or introduce some new ones. Those would then > naturally > > >> be > > >>> applicable to the metrics push introduced in KIP-714. > > >>> > > >>> In a similar vein, there are no existing client metrics specifically > > for > > >>> auto-commit. > > >>> We could add them to Kafka, but I really think this is just an > example > > of > > >>> asynchronous > > >>> commit in which the application has decided not to specify when the > > >> commit > > >>> should > > >>> begin. > > >>> > > >>> It is possible to increase the cadence of pushing by modifying the > > >>> interval.ms > > >>> configuration property of the CLIENT_METRICS resource. > > >>> > > >>> There is an “assigned-partitions” metric for each consumer, but not > one > > >> for > > >>> active partitions. We could add one, again as a follow-on KIP. > > >>> > > >>> I take your point about holding on to a connection in a channel which > > >> might > > >>> experience congestion. Do you have a suggestion for how to improve on > > >> this? > > >>> For example, the client does have the concept of a least-loaded node. > > >> Maybe > > >>> this is something we should investigate in the implementation and > > decide > > >>> on the > > >>> best approach. In general, I think sticking with the same node for > > >>> consecutive > > >>> pushes is best, but if you choose the “wrong” node to start with, > it’s > > >> not > > >>> ideal. > > >>> > > >>> Thanks, > > >>> Andrew > > >>> > > >>>> On 8 Sep 2023, at 19:29, Philip Nee <philip...@gmail.com> wrote: > > >>>> > > >>>> Hey Andrew - > > >>>> > > >>>> +1 but I don't have a binding vote! > > >>>> > > >>>> It took me a while to go through the KIP. Here are some of my notes > > >>> during > > >>>> the reading: > > >>>> > > >>>> *Metrics* > > >>>> - Should we care about the client's leader epoch? There is a case > > where > > >>> the > > >>>> user recreates the topic, but the consumer thinks it is still the > same > > >>>> topic and therefore, attempts to start from an offset that doesn't > > >> exist. > > >>>> KIP-848 addresses this issue, but I can still see some potential > > >> benefits > > >>>> from knowing the client's epoch information. > > >>>> - I assume poll idle is similar to poll interval: I needed to read > the > > >>>> description a few times. > > >>>> - I don't have a clear use case in mind for the commit latency, but > I > > >> do > > >>>> think sometimes people lack clarity about how much progress was > > tracked > > >>> by > > >>>> the auto-commit. Would tracking auto-commit-related metrics be > > >> useful? I > > >>>> was thinking: the last offset committed or the actual cadence in ms. > > >>>> - Are there cases when we need to increase the cadence of telemetry > > >> data > > >>>> push? i.e. variable interval. > > >>>> - Thanks for implementing the randomized initial metric push; I > think > > >> it > > >>> is > > >>>> really important. > > >>>> - Is there a potential use case for tracking the number of active > > >>>> partitions? The consumer can pause partitions via API, during > > >> revocation, > > >>>> or during offset reset for the stream. > > >>>> > > >>>> *Connections*: > > >>>> - The KIP stated that it will keep the same connection until the > > >>> connection > > >>>> is disconnected. I wonder if that could potentially cause congestion > > if > > >>> it > > >>>> is already a busy channel, which leads to connection timeout and > > >>>> subsequently disconnection. > > >>>> > > >>>> Thanks, > > >>>> P > > >>>> > > >>>> On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield < > > >>>> andrew_schofield_j...@outlook.com> wrote: > > >>>> > > >>>>> Bumping the voting thread for KIP-714. > > >>>>> > > >>>>> So far, we have: > > >>>>> Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne) > > >>>>> > > >>>>> Thanks, > > >>>>> Andrew > > >>>>> > > >>>>>> On 4 Aug 2023, at 09:45, Andrew Schofield < > > andrew_schofi...@live.com > > >>> > > >>>>> wrote: > > >>>>>> > > >>>>>> Hi, > > >>>>>> After almost 2 1/2 years in the making, I would like to call a > vote > > >> for > > >>>>> KIP-714 ( > > >>>>> > > >>> > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability > > >>>>> ). > > >>>>>> > > >>>>>> This KIP aims to improve monitoring and troubleshooting of client > > >>>>> performance by enabling clients to push metrics to brokers. > > >>>>>> > > >>>>>> I’d like to thank everyone that participated in the discussion, > > >>>>> especially the librdkafka team since one of the aims of the KIP is > to > > >>>>> enable any client to participate, not just the Apache Kafka > project’s > > >>> Java > > >>>>> clients. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Andrew > > >>> > > >>> > > >>> > > >> > > > > > >