Re: [VOTE] KIP-714: Client metrics and observability

Jun Rao Fri, 13 Oct 2023 17:52:32 -0700

Hi, Andrew,

Thanks for the KIP. +1 from me too.


Jun

On Wed, Oct 11, 2023 at 4:00 PM Sophie Blee-Goldman <sop...@responsive.dev>
wrote:

> This looks great! +1 (binding)
>
> Sophie
>
> On Wed, Oct 11, 2023 at 1:46 PM Matthias J. Sax <mj...@apache.org> wrote:
>
> > +1 (binding)
> >
> > On 9/13/23 5:48 PM, Jason Gustafson wrote:
> > > Hey Andrew,
> > >
> > > +1 on the KIP. For many users of Kafka, it may not be fully understood
> > how
> > > much of a challenge client monitoring is. With tens of clients in a
> > > cluster, it is already difficult to coordinate metrics collection. When
> > > there are thousands of clients, and when the cluster operator has no
> > > control over them, it is essentially impossible. For the fat clients
> that
> > > we have, the lack of useful telemetry is a huge operational gap.
> > > Consistency between clients has also been a major challenge. I think
> the
> > > effort toward standardization in this KIP will have some positive
> impact
> > > even in deployments which have effective client-side monitoring.
> > Overall, I
> > > think this proposal will provide a lot of value across the board.
> > >
> > > Best,
> > > Jason
> > >
> > > On Wed, Sep 13, 2023 at 9:50 AM Philip Nee <philip...@gmail.com>
> wrote:
> > >
> > >> Hey Andrew -
> > >>
> > >> Thank you for taking the time to reply to my questions. I'm just
> adding
> > >> some notes to this discussion.
> > >>
> > >> 1. epoch: It can be helpful to know the delta of the client side and
> the
> > >> actual leader epoch.  It is helpful to understand why sometimes commit
> > >> fails/client not making progress.
> > >> 2. Client connection: If the client selects the "wrong" connection to
> > push
> > >> out the data, I assume the request would timeout; which should lead to
> > >> disconnecting from the node and reselecting another node as you
> > mentioned,
> > >> via the least loaded node.
> > >>
> > >> Cheers,
> > >> P
> > >>
> > >>
> > >> On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
> > >> andrew_schofield_j...@outlook.com> wrote:
> > >>
> > >>> Hi Philip,
> > >>> Thanks for your vote and interest in the KIP.
> > >>>
> > >>> KIP-714 does not introduce any new client metrics, and that’s
> > >> intentional.
> > >>> It does
> > >>> tell how that all of the client metrics can have their names
> > transformed
> > >>> into
> > >>> equivalent "telemetry metric names”, and then potentially used in
> > metrics
> > >>> subscriptions.
> > >>>
> > >>> I am interested in the idea of client’s leader epoch in this context,
> > but
> > >>> I don’t have
> > >>> an immediate plan for how best to do this, and it would take another
> > KIP
> > >>> to enhance
> > >>> existing metrics or introduce some new ones. Those would then
> naturally
> > >> be
> > >>> applicable to the metrics push introduced in KIP-714.
> > >>>
> > >>> In a similar vein, there are no existing client metrics specifically
> > for
> > >>> auto-commit.
> > >>> We could add them to Kafka, but I really think this is just an
> example
> > of
> > >>> asynchronous
> > >>> commit in which the application has decided not to specify when the
> > >> commit
> > >>> should
> > >>> begin.
> > >>>
> > >>> It is possible to increase the cadence of pushing by modifying the
> > >>> interval.ms
> > >>> configuration property of the CLIENT_METRICS resource.
> > >>>
> > >>> There is an “assigned-partitions” metric for each consumer, but not
> one
> > >> for
> > >>> active partitions. We could add one, again as a follow-on KIP.
> > >>>
> > >>> I take your point about holding on to a connection in a channel which
> > >> might
> > >>> experience congestion. Do you have a suggestion for how to improve on
> > >> this?
> > >>> For example, the client does have the concept of a least-loaded node.
> > >> Maybe
> > >>> this is something we should investigate in the implementation and
> > decide
> > >>> on the
> > >>> best approach. In general, I think sticking with the same node for
> > >>> consecutive
> > >>> pushes is best, but if you choose the “wrong” node to start with,
> it’s
> > >> not
> > >>> ideal.
> > >>>
> > >>> Thanks,
> > >>> Andrew
> > >>>
> > >>>> On 8 Sep 2023, at 19:29, Philip Nee <philip...@gmail.com> wrote:
> > >>>>
> > >>>> Hey Andrew -
> > >>>>
> > >>>> +1 but I don't have a binding vote!
> > >>>>
> > >>>> It took me a while to go through the KIP. Here are some of my notes
> > >>> during
> > >>>> the reading:
> > >>>>
> > >>>> *Metrics*
> > >>>> - Should we care about the client's leader epoch? There is a case
> > where
> > >>> the
> > >>>> user recreates the topic, but the consumer thinks it is still the
> same
> > >>>> topic and therefore, attempts to start from an offset that doesn't
> > >> exist.
> > >>>> KIP-848 addresses this issue, but I can still see some potential
> > >> benefits
> > >>>> from knowing the client's epoch information.
> > >>>> - I assume poll idle is similar to poll interval: I needed to read
> the
> > >>>> description a few times.
> > >>>> - I don't have a clear use case in mind for the commit latency, but
> I
> > >> do
> > >>>> think sometimes people lack clarity about how much progress was
> > tracked
> > >>> by
> > >>>> the auto-commit.  Would tracking auto-commit-related metrics be
> > >> useful? I
> > >>>> was thinking: the last offset committed or the actual cadence in ms.
> > >>>> - Are there cases when we need to increase the cadence of telemetry
> > >> data
> > >>>> push? i.e. variable interval.
> > >>>> - Thanks for implementing the randomized initial metric push; I
> think
> > >> it
> > >>> is
> > >>>> really important.
> > >>>> - Is there a potential use case for tracking the number of active
> > >>>> partitions? The consumer can pause partitions via API, during
> > >> revocation,
> > >>>> or during offset reset for the stream.
> > >>>>
> > >>>> *Connections*:
> > >>>> - The KIP stated that it will keep the same connection until the
> > >>> connection
> > >>>> is disconnected. I wonder if that could potentially cause congestion
> > if
> > >>> it
> > >>>> is already a busy channel, which leads to connection timeout and
> > >>>> subsequently disconnection.
> > >>>>
> > >>>> Thanks,
> > >>>> P
> > >>>>
> > >>>> On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield <
> > >>>> andrew_schofield_j...@outlook.com> wrote:
> > >>>>
> > >>>>> Bumping the voting thread for KIP-714.
> > >>>>>
> > >>>>> So far, we have:
> > >>>>> Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne)
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Andrew
> > >>>>>
> > >>>>>> On 4 Aug 2023, at 09:45, Andrew Schofield <
> > andrew_schofi...@live.com
> > >>>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>> After almost 2 1/2 years in the making, I would like to call a
> vote
> > >> for
> > >>>>> KIP-714 (
> > >>>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > >>>>> ).
> > >>>>>>
> > >>>>>> This KIP aims to improve monitoring and troubleshooting of client
> > >>>>> performance by enabling clients to push metrics to brokers.
> > >>>>>>
> > >>>>>> I’d like to thank everyone that participated in the discussion,
> > >>>>> especially the librdkafka team since one of the aims of the KIP is
> to
> > >>>>> enable any client to participate, not just the Apache Kafka
> project’s
> > >>> Java
> > >>>>> clients.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Andrew
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: [VOTE] KIP-714: Client metrics and observability

Reply via email to