Re: [VOTE] KIP-714: Client metrics and observability

Andrew Schofield Mon, 16 Oct 2023 01:18:59 -0700

The vote for KIP-714 has now concluded and the KIP is APPROVED.

The votes are:
Binding:
   +4 (Jason, Matthias, Sophie, Jun)
Non-binding:
   +3 (Milind, Kirk, Philip)
   -1 (Ryanne)


This KIP aims to improve monitoring and troubleshooting of client
performance by enabling clients to push metrics to brokers. The lack of
consistent telemetry across clients is an operational gap, and many cluster
operators do not have control over the clients. Often, asking the client owner
to change the configuration or even application code in order to troubleshoot
problems is not workable. This is why the KIP enables the broker to request
metrics from clients, giving a consistent, cross-platform mechanism.

The feature is enabled by configuring a metrics plugin on the brokers which
implements the ClientTelemetry interface. In the absence of a plugin with this
interface, the brokers do not even support the new RPCs in this KIP and the
clients will not attempt or be able to push metrics. So, a vanilla Apache Kafka
broker will not collect metrics.

I would like to make available an open-source implementation of the 
ClientTelemetry
interface that works with an open-source monitoring solution.

The KIP does put support for OTLP serialisation into the client, so there are
new dependencies in the Java client, which are bundled and relocated (shaded).
OTLP also opens up other use cases involving OpenTelemetry in the future, which
is emerging as the de facto standard for telemetry, and observability in 
general.

Thanks to everyone who has contributed to KIP-714 since Magnus Edenhill
kicked it all off in February 2021.

Andrew

> On 14 Oct 2023, at 01:52, Jun Rao <[email protected]> wrote:
>
> Hi, Andrew,
>
> Thanks for the KIP. +1 from me too.
>
> Jun
>
> On Wed, Oct 11, 2023 at 4:00 PM Sophie Blee-Goldman <[email protected]>
> wrote:
>
>> This looks great! +1 (binding)
>>
>> Sophie
>>
>> On Wed, Oct 11, 2023 at 1:46 PM Matthias J. Sax <[email protected]> wrote:
>>
>>> +1 (binding)
>>>
>>> On 9/13/23 5:48 PM, Jason Gustafson wrote:
>>>> Hey Andrew,
>>>>
>>>> +1 on the KIP. For many users of Kafka, it may not be fully understood
>>> how
>>>> much of a challenge client monitoring is. With tens of clients in a
>>>> cluster, it is already difficult to coordinate metrics collection. When
>>>> there are thousands of clients, and when the cluster operator has no
>>>> control over them, it is essentially impossible. For the fat clients
>> that
>>>> we have, the lack of useful telemetry is a huge operational gap.
>>>> Consistency between clients has also been a major challenge. I think
>> the
>>>> effort toward standardization in this KIP will have some positive
>> impact
>>>> even in deployments which have effective client-side monitoring.
>>> Overall, I
>>>> think this proposal will provide a lot of value across the board.
>>>>
>>>> Best,
>>>> Jason
>>>>
>>>> On Wed, Sep 13, 2023 at 9:50 AM Philip Nee <[email protected]>
>> wrote:
>>>>
>>>>> Hey Andrew -
>>>>>
>>>>> Thank you for taking the time to reply to my questions. I'm just
>> adding
>>>>> some notes to this discussion.
>>>>>
>>>>> 1. epoch: It can be helpful to know the delta of the client side and
>> the
>>>>> actual leader epoch.  It is helpful to understand why sometimes commit
>>>>> fails/client not making progress.
>>>>> 2. Client connection: If the client selects the "wrong" connection to
>>> push
>>>>> out the data, I assume the request would timeout; which should lead to
>>>>> disconnecting from the node and reselecting another node as you
>>> mentioned,
>>>>> via the least loaded node.
>>>>>
>>>>> Cheers,
>>>>> P
>>>>>
>>>>>
>>>>> On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Philip,
>>>>>> Thanks for your vote and interest in the KIP.
>>>>>>
>>>>>> KIP-714 does not introduce any new client metrics, and that’s
>>>>> intentional.
>>>>>> It does
>>>>>> tell how that all of the client metrics can have their names
>>> transformed
>>>>>> into
>>>>>> equivalent "telemetry metric names”, and then potentially used in
>>> metrics
>>>>>> subscriptions.
>>>>>>
>>>>>> I am interested in the idea of client’s leader epoch in this context,
>>> but
>>>>>> I don’t have
>>>>>> an immediate plan for how best to do this, and it would take another
>>> KIP
>>>>>> to enhance
>>>>>> existing metrics or introduce some new ones. Those would then
>> naturally
>>>>> be
>>>>>> applicable to the metrics push introduced in KIP-714.
>>>>>>
>>>>>> In a similar vein, there are no existing client metrics specifically
>>> for
>>>>>> auto-commit.
>>>>>> We could add them to Kafka, but I really think this is just an
>> example
>>> of
>>>>>> asynchronous
>>>>>> commit in which the application has decided not to specify when the
>>>>> commit
>>>>>> should
>>>>>> begin.
>>>>>>
>>>>>> It is possible to increase the cadence of pushing by modifying the
>>>>>> interval.ms
>>>>>> configuration property of the CLIENT_METRICS resource.
>>>>>>
>>>>>> There is an “assigned-partitions” metric for each consumer, but not
>> one
>>>>> for
>>>>>> active partitions. We could add one, again as a follow-on KIP.
>>>>>>
>>>>>> I take your point about holding on to a connection in a channel which
>>>>> might
>>>>>> experience congestion. Do you have a suggestion for how to improve on
>>>>> this?
>>>>>> For example, the client does have the concept of a least-loaded node.
>>>>> Maybe
>>>>>> this is something we should investigate in the implementation and
>>> decide
>>>>>> on the
>>>>>> best approach. In general, I think sticking with the same node for
>>>>>> consecutive
>>>>>> pushes is best, but if you choose the “wrong” node to start with,
>> it’s
>>>>> not
>>>>>> ideal.
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>>> On 8 Sep 2023, at 19:29, Philip Nee <[email protected]> wrote:
>>>>>>>
>>>>>>> Hey Andrew -
>>>>>>>
>>>>>>> +1 but I don't have a binding vote!
>>>>>>>
>>>>>>> It took me a while to go through the KIP. Here are some of my notes
>>>>>> during
>>>>>>> the reading:
>>>>>>>
>>>>>>> *Metrics*
>>>>>>> - Should we care about the client's leader epoch? There is a case
>>> where
>>>>>> the
>>>>>>> user recreates the topic, but the consumer thinks it is still the
>> same
>>>>>>> topic and therefore, attempts to start from an offset that doesn't
>>>>> exist.
>>>>>>> KIP-848 addresses this issue, but I can still see some potential
>>>>> benefits
>>>>>>> from knowing the client's epoch information.
>>>>>>> - I assume poll idle is similar to poll interval: I needed to read
>> the
>>>>>>> description a few times.
>>>>>>> - I don't have a clear use case in mind for the commit latency, but
>> I
>>>>> do
>>>>>>> think sometimes people lack clarity about how much progress was
>>> tracked
>>>>>> by
>>>>>>> the auto-commit.  Would tracking auto-commit-related metrics be
>>>>> useful? I
>>>>>>> was thinking: the last offset committed or the actual cadence in ms.
>>>>>>> - Are there cases when we need to increase the cadence of telemetry
>>>>> data
>>>>>>> push? i.e. variable interval.
>>>>>>> - Thanks for implementing the randomized initial metric push; I
>> think
>>>>> it
>>>>>> is
>>>>>>> really important.
>>>>>>> - Is there a potential use case for tracking the number of active
>>>>>>> partitions? The consumer can pause partitions via API, during
>>>>> revocation,
>>>>>>> or during offset reset for the stream.
>>>>>>>
>>>>>>> *Connections*:
>>>>>>> - The KIP stated that it will keep the same connection until the
>>>>>> connection
>>>>>>> is disconnected. I wonder if that could potentially cause congestion
>>> if
>>>>>> it
>>>>>>> is already a busy channel, which leads to connection timeout and
>>>>>>> subsequently disconnection.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> P
>>>>>>>
>>>>>>> On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Bumping the voting thread for KIP-714.
>>>>>>>>
>>>>>>>> So far, we have:
>>>>>>>> Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>> On 4 Aug 2023, at 09:45, Andrew Schofield <
>>> [email protected]
>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> After almost 2 1/2 years in the making, I would like to call a
>> vote
>>>>> for
>>>>>>>> KIP-714 (
>>>>>>>>
>>>>>>
>>>>>
>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>>>>> ).
>>>>>>>>>
>>>>>>>>> This KIP aims to improve monitoring and troubleshooting of client
>>>>>>>> performance by enabling clients to push metrics to brokers.
>>>>>>>>>
>>>>>>>>> I’d like to thank everyone that participated in the discussion,
>>>>>>>> especially the librdkafka team since one of the aims of the KIP is
>> to
>>>>>>>> enable any client to participate, not just the Apache Kafka
>> project’s
>>>>>> Java
>>>>>>>> clients.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Andrew

Re: [VOTE] KIP-714: Client metrics and observability

Reply via email to