Re: [VOTE] KIP-714: Client metrics and observability

Matthias J. Sax Wed, 11 Oct 2023 13:46:42 -0700

+1 (binding)

On 9/13/23 5:48 PM, Jason Gustafson wrote:

Hey Andrew,


+1 on the KIP. For many users of Kafka, it may not be fully understood how
much of a challenge client monitoring is. With tens of clients in a
cluster, it is already difficult to coordinate metrics collection. When
there are thousands of clients, and when the cluster operator has no
control over them, it is essentially impossible. For the fat clients that
we have, the lack of useful telemetry is a huge operational gap.
Consistency between clients has also been a major challenge. I think the
effort toward standardization in this KIP will have some positive impact
even in deployments which have effective client-side monitoring. Overall, I
think this proposal will provide a lot of value across the board.

Best,
Jason

On Wed, Sep 13, 2023 at 9:50 AM Philip Nee <[email protected]> wrote:

Hey Andrew -

Thank you for taking the time to reply to my questions. I'm just adding
some notes to this discussion.

1. epoch: It can be helpful to know the delta of the client side and the
actual leader epoch.  It is helpful to understand why sometimes commit
fails/client not making progress.
2. Client connection: If the client selects the "wrong" connection to push
out the data, I assume the request would timeout; which should lead to
disconnecting from the node and reselecting another node as you mentioned,
via the least loaded node.

Cheers,
P


On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
[email protected]> wrote:

Hi Philip,
Thanks for your vote and interest in the KIP.

KIP-714 does not introduce any new client metrics, and that’s

intentional.

It does
tell how that all of the client metrics can have their names transformed
into
equivalent "telemetry metric names”, and then potentially used in metrics
subscriptions.

I am interested in the idea of client’s leader epoch in this context, but
I don’t have
an immediate plan for how best to do this, and it would take another KIP
to enhance
existing metrics or introduce some new ones. Those would then naturally

be

applicable to the metrics push introduced in KIP-714.

In a similar vein, there are no existing client metrics specifically for
auto-commit.
We could add them to Kafka, but I really think this is just an example of
asynchronous
commit in which the application has decided not to specify when the

commit

should
begin.

It is possible to increase the cadence of pushing by modifying the
interval.ms
configuration property of the CLIENT_METRICS resource.

There is an “assigned-partitions” metric for each consumer, but not one

for

active partitions. We could add one, again as a follow-on KIP.

I take your point about holding on to a connection in a channel which

might

experience congestion. Do you have a suggestion for how to improve on

this?

For example, the client does have the concept of a least-loaded node.

Maybe

this is something we should investigate in the implementation and decide
on the
best approach. In general, I think sticking with the same node for
consecutive
pushes is best, but if you choose the “wrong” node to start with, it’s

not

ideal.

Thanks,
Andrew

On 8 Sep 2023, at 19:29, Philip Nee <[email protected]> wrote:

Hey Andrew -

+1 but I don't have a binding vote!

It took me a while to go through the KIP. Here are some of my notes

during

the reading:

*Metrics*
- Should we care about the client's leader epoch? There is a case where

the

user recreates the topic, but the consumer thinks it is still the same
topic and therefore, attempts to start from an offset that doesn't

exist.

KIP-848 addresses this issue, but I can still see some potential

benefits

from knowing the client's epoch information.
- I assume poll idle is similar to poll interval: I needed to read the
description a few times.
- I don't have a clear use case in mind for the commit latency, but I

do

think sometimes people lack clarity about how much progress was tracked

by

the auto-commit.  Would tracking auto-commit-related metrics be

useful? I

was thinking: the last offset committed or the actual cadence in ms.
- Are there cases when we need to increase the cadence of telemetry

data

push? i.e. variable interval.
- Thanks for implementing the randomized initial metric push; I think

it

is

really important.
- Is there a potential use case for tracking the number of active
partitions? The consumer can pause partitions via API, during

revocation,

or during offset reset for the stream.

*Connections*:
- The KIP stated that it will keep the same connection until the

connection

is disconnected. I wonder if that could potentially cause congestion if

it

is already a busy channel, which leads to connection timeout and
subsequently disconnection.

Thanks,
P

On Fri, Sep 8, 2023 at 4:15 AM Andrew Schofield <
[email protected]> wrote:

Bumping the voting thread for KIP-714.

So far, we have:
Non-binding +2 (Milind and Kirk), non-binding -1 (Ryanne)

Thanks,
Andrew

On 4 Aug 2023, at 09:45, Andrew Schofield <[email protected]

wrote:


Hi,
After almost 2 1/2 years in the making, I would like to call a vote

for

KIP-714 (

https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability

).


This KIP aims to improve monitoring and troubleshooting of client

performance by enabling clients to push metrics to brokers.


I’d like to thank everyone that participated in the discussion,

especially the librdkafka team since one of the aims of the KIP is to
enable any client to participate, not just the Apache Kafka project’s

Java

clients.


Thanks,
Andrew

Re: [VOTE] KIP-714: Client metrics and observability

Reply via email to