Re: [VOTE] KIP-714: Client metrics and observability

Jason Gustafson Wed, 13 Sep 2023 17:49:11 -0700

Hey Andrew,

+1 on the KIP. For many users of Kafka, it may not be fully understood how
much of a challenge client monitoring is. With tens of clients in a
cluster, it is already difficult to coordinate metrics collection. When
there are thousands of clients, and when the cluster operator has no
control over them, it is essentially impossible. For the fat clients that
we have, the lack of useful telemetry is a huge operational gap.
Consistency between clients has also been a major challenge. I think the
effort toward standardization in this KIP will have some positive impact
even in deployments which have effective client-side monitoring. Overall, I
think this proposal will provide a lot of value across the board.


Best,
Jason

On Wed, Sep 13, 2023 at 9:50 AM Philip Nee <[email protected]> wrote:

> Hey Andrew -
>
> Thank you for taking the time to reply to my questions. I'm just adding
> some notes to this discussion.
>
> 1. epoch: It can be helpful to know the delta of the client side and the
> actual leader epoch.  It is helpful to understand why sometimes commit
> fails/client not making progress.
> 2. Client connection: If the client selects the "wrong" connection to push
> out the data, I assume the request would timeout; which should lead to
> disconnecting from the node and reselecting another node as you mentioned,
> via the least loaded node.
>
> Cheers,
> P
>
>
> On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
> [email protected]> wrote:
>
> > Hi Philip,
> > Thanks for your vote and interest in the KIP.
> >
> > KIP-714 does not introduce any new client metrics, and that’s
> intentional.
> > It does
> > tell how that all of the client metrics can have their names transformed
> > into
> > equivalent "telemetry metric names”, and then potentially used in metrics
> > subscriptions.
> >
> > I am interested in the idea of client’s leader epoch in this context, but
> > I don’t have
> > an immediate plan for how best to do this, and it would take another KIP
> > to enhance
> > existing metrics or introduce some new ones. Those would then naturally
> be
> > applicable to the metrics push introduced in KIP-714.
> >
> > In a similar vein, there are no existing client metrics specifically for
> > auto-commit.
> > We could add them to Kafka, but I really think this is just an example of
> > asynchronous
> > commit in which the application has decided not to specify when the
> commit
> > should
> > begin.
> >
> > It is possible to increase the cadence of pushing by modifying the
> > interval.ms
> > configuration property of the CLIENT_METRICS resource.
> >
> > There is an “assigned-partitions” metric for each consumer, but not one
> for
> > active partitions. We could add one, again as a follow-on KIP.
> >
> > I take your point about holding on to a connection in a channel which
> might
> > experience congestion. Do you have a suggestion for how to improve on
> this?
> > For example, the client does have the concept of a least-loaded node.
> Maybe
> > this is something we should investigate in the implementation and decide
> > on the
> > best approach. In general, I think sticking with the same node for
> > consecutive
> > pushes is best, but if you choose the “wrong” node to start with, it’s
> not
> > ideal.
> >
> > Thanks,
> > Andrew
> >
> > > On 8 Sep 2023, at 19:29, Philip Nee <[email protected]> wrote:
> > >
> > > Hey Andrew -
> > >
> > > +1 but I don't have a binding vote!
> > >
> > > It took me a while to go through the KIP. Here are some of my notes
> > during
> > > the reading:
> > >
> > > *Metrics*
> > > - Should we care about the client's leader epoch? There is a case where
> > the
> > > user recreates the topic, but the consumer thinks it is still the same
> > > topic and therefore, attempts to start from an offset that doesn't
> exist.
> > > KIP-848 addresses this issue, but I can still see some potential
> benefits
> > > from knowing the client's epoch information.
> > > - I assume poll idle is similar to poll interval: I needed to read the
> > > description a few times.
> > > - I don't have a clear use case in mind for the commit latency, but I
> do
> > > think sometimes people lack clarity about how much progress was tracked
> > by
> > > the auto-commit.  Would tracking auto-commit-related metrics be
> useful? I
> > > was thinking: the last offset committed or the actual cadence in ms.
> > > - Are there cases when we need to increase the cadence of telemetry
> data
> > > push? i.e. variable interval.
> > > - Thanks for implementing the randomized initial metric push; I think
> it
> > is
> > > really important.
> > > - Is there a potential use case for tracking the number of active
> > > partitions? The consumer can pause partitions via API, during
> revocation,
> > > or during offset reset for the stream.
> > >
> > > *Connections*:
> > > - The KIP stated that it will keep the same connection until the
> > connection
> > > is disconnected. I wonder if that could potentially cause congestion if
> > it
> > > is already a busy channel, which leads to connection timeout and
> > > subsequently disconnection.
> > >
> > > Thanks,
> > > P
> > >
> > > On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield <
> > > [email protected]> wrote:
> > >
> > >> Bumping the voting thread for KIP-714.
> > >>
> > >> So far, we have:
> > >> Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne)
> > >>
> > >> Thanks,
> > >> Andrew
> > >>
> > >>> On 4 Aug 2023, at 09:45, Andrew Schofield <[email protected]
> >
> > >> wrote:
> > >>>
> > >>> Hi,
> > >>> After almost 2 1/2 years in the making, I would like to call a vote
> for
> > >> KIP-714 (
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > >> ).
> > >>>
> > >>> This KIP aims to improve monitoring and troubleshooting of client
> > >> performance by enabling clients to push metrics to brokers.
> > >>>
> > >>> I’d like to thank everyone that participated in the discussion,
> > >> especially the librdkafka team since one of the aims of the KIP is to
> > >> enable any client to participate, not just the Apache Kafka project’s
> > Java
> > >> clients.
> > >>>
> > >>> Thanks,
> > >>> Andrew
> >
> >
> >
>

Re: [VOTE] KIP-714: Client metrics and observability

Reply via email to