Re: [VOTE] KIP-714: Client metrics and observability

Sophie Blee-Goldman Wed, 11 Oct 2023 16:00:17 -0700

This looks great! +1 (binding)

Sophie


On Wed, Oct 11, 2023 at 1:46 PM Matthias J. Sax <mj...@apache.org> wrote:

> +1 (binding)
>
> On 9/13/23 5:48 PM, Jason Gustafson wrote:
> > Hey Andrew,
> >
> > +1 on the KIP. For many users of Kafka, it may not be fully understood
> how
> > much of a challenge client monitoring is. With tens of clients in a
> > cluster, it is already difficult to coordinate metrics collection. When
> > there are thousands of clients, and when the cluster operator has no
> > control over them, it is essentially impossible. For the fat clients that
> > we have, the lack of useful telemetry is a huge operational gap.
> > Consistency between clients has also been a major challenge. I think the
> > effort toward standardization in this KIP will have some positive impact
> > even in deployments which have effective client-side monitoring.
> Overall, I
> > think this proposal will provide a lot of value across the board.
> >
> > Best,
> > Jason
> >
> > On Wed, Sep 13, 2023 at 9:50 AM Philip Nee <philip...@gmail.com> wrote:
> >
> >> Hey Andrew -
> >>
> >> Thank you for taking the time to reply to my questions. I'm just adding
> >> some notes to this discussion.
> >>
> >> 1. epoch: It can be helpful to know the delta of the client side and the
> >> actual leader epoch.  It is helpful to understand why sometimes commit
> >> fails/client not making progress.
> >> 2. Client connection: If the client selects the "wrong" connection to
> push
> >> out the data, I assume the request would timeout; which should lead to
> >> disconnecting from the node and reselecting another node as you
> mentioned,
> >> via the least loaded node.
> >>
> >> Cheers,
> >> P
> >>
> >>
> >> On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
> >> andrew_schofield_j...@outlook.com> wrote:
> >>
> >>> Hi Philip,
> >>> Thanks for your vote and interest in the KIP.
> >>>
> >>> KIP-714 does not introduce any new client metrics, and that’s
> >> intentional.
> >>> It does
> >>> tell how that all of the client metrics can have their names
> transformed
> >>> into
> >>> equivalent "telemetry metric names”, and then potentially used in
> metrics
> >>> subscriptions.
> >>>
> >>> I am interested in the idea of client’s leader epoch in this context,
> but
> >>> I don’t have
> >>> an immediate plan for how best to do this, and it would take another
> KIP
> >>> to enhance
> >>> existing metrics or introduce some new ones. Those would then naturally
> >> be
> >>> applicable to the metrics push introduced in KIP-714.
> >>>
> >>> In a similar vein, there are no existing client metrics specifically
> for
> >>> auto-commit.
> >>> We could add them to Kafka, but I really think this is just an example
> of
> >>> asynchronous
> >>> commit in which the application has decided not to specify when the
> >> commit
> >>> should
> >>> begin.
> >>>
> >>> It is possible to increase the cadence of pushing by modifying the
> >>> interval.ms
> >>> configuration property of the CLIENT_METRICS resource.
> >>>
> >>> There is an “assigned-partitions” metric for each consumer, but not one
> >> for
> >>> active partitions. We could add one, again as a follow-on KIP.
> >>>
> >>> I take your point about holding on to a connection in a channel which
> >> might
> >>> experience congestion. Do you have a suggestion for how to improve on
> >> this?
> >>> For example, the client does have the concept of a least-loaded node.
> >> Maybe
> >>> this is something we should investigate in the implementation and
> decide
> >>> on the
> >>> best approach. In general, I think sticking with the same node for
> >>> consecutive
> >>> pushes is best, but if you choose the “wrong” node to start with, it’s
> >> not
> >>> ideal.
> >>>
> >>> Thanks,
> >>> Andrew
> >>>
> >>>> On 8 Sep 2023, at 19:29, Philip Nee <philip...@gmail.com> wrote:
> >>>>
> >>>> Hey Andrew -
> >>>>
> >>>> +1 but I don't have a binding vote!
> >>>>
> >>>> It took me a while to go through the KIP. Here are some of my notes
> >>> during
> >>>> the reading:
> >>>>
> >>>> *Metrics*
> >>>> - Should we care about the client's leader epoch? There is a case
> where
> >>> the
> >>>> user recreates the topic, but the consumer thinks it is still the same
> >>>> topic and therefore, attempts to start from an offset that doesn't
> >> exist.
> >>>> KIP-848 addresses this issue, but I can still see some potential
> >> benefits
> >>>> from knowing the client's epoch information.
> >>>> - I assume poll idle is similar to poll interval: I needed to read the
> >>>> description a few times.
> >>>> - I don't have a clear use case in mind for the commit latency, but I
> >> do
> >>>> think sometimes people lack clarity about how much progress was
> tracked
> >>> by
> >>>> the auto-commit.  Would tracking auto-commit-related metrics be
> >> useful? I
> >>>> was thinking: the last offset committed or the actual cadence in ms.
> >>>> - Are there cases when we need to increase the cadence of telemetry
> >> data
> >>>> push? i.e. variable interval.
> >>>> - Thanks for implementing the randomized initial metric push; I think
> >> it
> >>> is
> >>>> really important.
> >>>> - Is there a potential use case for tracking the number of active
> >>>> partitions? The consumer can pause partitions via API, during
> >> revocation,
> >>>> or during offset reset for the stream.
> >>>>
> >>>> *Connections*:
> >>>> - The KIP stated that it will keep the same connection until the
> >>> connection
> >>>> is disconnected. I wonder if that could potentially cause congestion
> if
> >>> it
> >>>> is already a busy channel, which leads to connection timeout and
> >>>> subsequently disconnection.
> >>>>
> >>>> Thanks,
> >>>> P
> >>>>
> >>>> On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield <
> >>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>
> >>>>> Bumping the voting thread for KIP-714.
> >>>>>
> >>>>> So far, we have:
> >>>>> Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne)
> >>>>>
> >>>>> Thanks,
> >>>>> Andrew
> >>>>>
> >>>>>> On 4 Aug 2023, at 09:45, Andrew Schofield <
> andrew_schofi...@live.com
> >>>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>> After almost 2 1/2 years in the making, I would like to call a vote
> >> for
> >>>>> KIP-714 (
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>>>> ).
> >>>>>>
> >>>>>> This KIP aims to improve monitoring and troubleshooting of client
> >>>>> performance by enabling clients to push metrics to brokers.
> >>>>>>
> >>>>>> I’d like to thank everyone that participated in the discussion,
> >>>>> especially the librdkafka team since one of the aims of the KIP is to
> >>>>> enable any client to participate, not just the Apache Kafka project’s
> >>> Java
> >>>>> clients.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Andrew
> >>>
> >>>
> >>>
> >>
> >
>

Re: [VOTE] KIP-714: Client metrics and observability

Reply via email to