Re: [DISCUSS] KIP-714: Client metrics and observability

Matthias J. Sax Thu, 12 Oct 2023 09:30:11 -0700

Thanks Sophie and Jun.

`clientInstanceIds()` is fine with me -- was not sure about the doubleplural myself.

Sorry if my comments was confusing. I was trying to say, that adding aoverload to `KafkaStreams` that does not take a timeout parameter doesnot make sense, because there is no `default.api.timeout.ms` config forKafka Streams, so users always need to pass in a timeout. (Same forproducer.)

For the implementation, I think KS would always call`client.clientInstanceId(timeout)` and never rely on`default.api.timeout.ms` though, so we can stay in control -- if atimeout is passed by the user, it would always overwrite`default.api.timeout.ms` on the consumer/admin and thus we should followthe same semantics in Kafka Streams, and overwrite it explicitly whencalling `client.clientInstanceId()`.

The proposed API also makes sense to me. I was just wondering if we wantto extend it for client users -- for KS we won't need/use thetimeout-less overloads.

130) My intent was to throw a TimeoutException if we cannot get allinstanceIds, because it's the standard contract for timeouts. It wouldalso be hard to tell for a user, if a full or partial result wasreturned (or we add a method `boolean isPartialResult()` to make iteasier for users).

If there is concerns/objections, I am also ok to return a partial result-- it would require a change to the newly added `ClientInstanceIds`return type -- for `adminInstanceId` we only return a `String` right now-- we might need to change this to `Optional<String>` so we are able toreturn a partial result?

131) Of course we could, but I am not sure what we would gain? In theend, implementation details would always leak because if we change thenumber of consumer we use, we would return different keys in the `Map`.Atm, the proposal implies that the same key might be used for the "main"and "restore" consumer of the same thread -- but we can make keys uniqueby adding a `-restore` suffix to the restore-consumer key if we mergeboth maps. -- Curious to hear what others think. I am very open to do itdifferently than currently proposed.



-Matthias


On 10/12/23 8:39 AM, Jun Rao wrote:

Hi, Matthias,

Thanks for the reply.

130. What would be the semantic? If the timeout has expired and only some
of the client instances' id have been retrieved, does the call return the
partial result or throw an exception?

131. Could we group all consumer instances in a single method since we are
returning the key for each instance already? This probably also avoids
exposing implementation details that could change over time.

Thanks,

Jun

On Thu, Oct 12, 2023 at 12:00 AM Sophie Blee-Goldman <sop...@responsive.dev>
wrote:

Regarding the naming, I personally think `clientInstanceId` makes sense for
the plain clients
  -- especially if we might later introduce the notion of an
`applicationInstanceId`.

I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API,
though, can we use
`clientInstanceIds` instead? (The difference being the placement of the
plural 's')
I would similarly rename the class to just ClientInstanceIds

we can also not have a timeout-less overload,  because `KafkaStreams` does

not have a `default.api.timeout.ms` config either


With respect to the timeout for the Kafka Streams API, I'm a bit confused
by the
doubletriple-negative of Matthias' comment here, but I was thinking about
this
earlier and this was my take: with the current proposal, we would allow
users to pass
in an absolute timeout as a parameter that would apply to the method as a
whole.
Meanwhile within the method we would issue separate calls to each of the
clients using
the default or user-configured value of their  `default.api.timeout.ms` as
the timeout
parameter.

So the API as proposed makes sense to me.


On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax <mj...@apache.org> wrote:

In can answer 130 and 131.

130) We cannot guarantee that all clients are already initialized due to
race conditions. We plan to not allow calling
`KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or
REBALANCING) though -- guess this slipped on the KIP and should be
added? But because StreamThreads can be added dynamically (and producer
might be created dynamically at runtime; cf below), we still cannot
guarantee that all clients are already initialized when the method is
called. Of course, we assume that all clients are most likely initialize
on the happy path, and blocking calls to `client.clientInstanceId()`
should be rare.

To address the worst case, we won't do a naive implementation and just
loop over all clients, but fan-out the call to the different
StreamThreads (and GlobalStreamThread if it exists), and use Futures to
gather the results.

Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so
we might do 3 blocking calls in the worst case (for EOSv1 we get a
producer per tasks, and we might end up doing more blocking calls if the
producers are not initialized yet). Note that EOSv1 is already
deprecated, and we are also working on thread refactoring that will
reduce the number of client on StreamThread to 2 -- and we have more
refactoring planned to reduce the number of clients even further.

Inside `KafakStreams#clientsInstanceIds()` we might only do single
blocking call for the admin client (ie, `admin.clientInstanceId()`).

I agree that we need to do some clever timeout management, but it seems
to be more of an implementation detail?

Do you have any particular concerns, or does the proposed implementation
as sketched above address your question?


130) If the Topology does not have a global-state-store, there won't be
a GlobalThread and thus not global consumer. Thus, we return an Optional.



On three related question for Andrew.

(1) Why is the method called `clientInstanceId()` and not just plain
`instanceId()`?

(2) Why so we return a `String` while but not a UUID type? The added
protocol request/response classes use UUIDs.

(3) Would it make sense to have an overloaded `clientInstanceId()`
method that does not take any parameter but uses `default.api.timeout`
config (this config does no exist on the producer though, so we could
only have it for consumer and admin at this point). We could of course
also add overloads like this later if user request them (and/or add
`default.api.timeout.ms` to the producer, too).

Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as
a method name though, as `KafkaStreams` itself does not have an
`instanceId` -- we can also not have a timeout-less overload, because
`KafkaStreams` does not have a `default.api.timeout.ms` config either
(and I don't think it make sense to add).



-Matthias

On 10/11/23 5:07 PM, Jun Rao wrote:

Hi, Andrew,

Thanks for the updated KIP. Just a few more minor comments.

130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for

all

consumer/producer/adminClient instances to be initialized? Are all

those

instances created during KafkaStreams initialization?

131. Why does globalConsumerInstanceId() return Optional<String> while
other consumer instances don't return Optional?

132. ClientMetricsSubscriptionRequestCount: Do we need this since we

have a

set of generic metrics
(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
report Request rate for every request type?

Thanks,

Jun

On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org>

wrote:

Thanks!

On 10/10/23 11:31 PM, Andrew Schofield wrote:

Matthias,
Yes, I think that’s a sensible way forward and the interface you

propose

looks good. I’ll update the KIP accordingly.


Thanks,
Andrew

On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote:

Andrew,

yes I would like to get this change into KIP-714 right way. Seems to

be

important, as we don't know if/when a follow-up KIP for Kafka Streams

would

land.


I was also thinking (and discussed with a few others) how to expose

it,

and we would propose the following:


We add a new method to `KafkaStreams` class:

      public ClientsInstanceIds clientsInstanceIds(Duration timeout);

The returned object is like below:

    public class ClientsInstanceIds {
      // we only have a single admin client per KS instance
      String adminInstanceId();

      // we only have a single global consumer per KS instance (if

any)

      // Optional<> because we might not have global-thread
      Optional<String> globalConsumerInstanceId();

      // return a <threadKey -> ClientInstanceId> mapping
      // for the underlying (restore-)consumers/producers
      Map<String, String> mainConsumerInstanceIds();
      Map<String, String> restoreConsumerInstanceIds();
      Map<String, String> producerInstanceIds();
}

For the `threadKey`, we would use some pattern like this:

    [Stream|StateUpdater]Thread-<threadIdx>


Would this work from your POV?



-Matthias


On 10/9/23 2:15 AM, Andrew Schofield wrote:

Hi Matthias,
Good point. Makes sense to me.
Is this something that can also be included in the proposed Kafka

Streams follow-on KIP, or would you prefer that I add it to KIP-714?

I have a slight preference for the former to put all of the KS

enhancements into a separate KIP.

Thanks,
Andrew

On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org>

wrote:


Thanks Andrew. SGTM.

One point you did not address is the idea to add a method to

`KafkaStreams` similar to the proposed `clientInstanceId()` that will

be

added to consumer/producer/admin clients.


Without addressing this, Kafka Streams users won't have a way to

get

the assigned `instanceId` of the internally created clients, and thus

it

would be very difficult for them to know which metrics that the broker
receives belong to a Kafka Streams app. It seems they would only find

the

`instanceIds` in the log4j output if they enable client logging?


Of course, because there is multiple clients inside Kafka Streams,

the return type cannot be an single "String", but must be some some

complex

data structure -- we could either add a new class, or return a
Map<String,String> using a client key that maps to the `instanceId`.


For example we could use the following key:

  [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]


(Of course, only the valid combination.)

Or maybe even better, we might want to return a `Future` because

collection all the `instanceId` might be a blocking all on each

client?

have already a few idea how it could be implemented but I don't think

it

must be discussed on the KIP, as it's an implementation detail.


Thoughts?


-Matthias

On 10/6/23 4:21 AM, Andrew Schofield wrote:

Hi Matthias,
Thanks for your comments. I agree that a follow-up KIP for Kafka

Streams makes sense. This KIP currently has made a bit

of an effort to embrace KS, but it’s not enough by a long way.
I have removed `application.id <http://application.id/>`. This

should be done properly in the follow-up KIP. I don’t believe there’s

downside to

removing it from this KIP.
I have reworded the statement about temporarily. In practice, the

implementation of this KIP that’s going on while the voting

progresses happens to use delta temporality, but that’s an

implementation detail. Supporting clients must support both

temporalities.
I thought about exposing the client instance ID as a metric, but

non-numeric metrics are not usual practice and tools

do not universally support them. I don’t think the KIP is

improved

by adding one now.

I have also added constants for the various Config classes for

ENABLE_METRICS_PUSH_CONFIG, including to

StreamsConfig. It’s best to be explicit about this.
Thanks,
Andrew

On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org>

wrote:


Hi,

I did not pay attention to this KIP in the past; seems it was

on-hold for a while.


Overall it sounds very useful, and I think we should extend this

with a follow up KIP for Kafka Streams. What is unclear to me at this

point

is the statement:

Kafka Streams applications have an application.id configured

and

this identifier should be included as the application_id metrics

label.


The `application.id` is currently only used as the (main)

consumer's `group.id` (and is part of an auto-generated `client.id`

if

the user does not set one).


This comment related to:

The following labels should be added by the client as

appropriate

before metrics are pushed.


Given that Kafka Streams uses the consumer/producer/admin client

as

"black boxes", a client does at this point not know that it's part of

Kafka Streams application, and thus, it won't be able to attach any

such

label to the metrics it sends. (Also producer and admin don't even

know

the

value of `application.id` -- only the (main) consumer, indirectly

via `

group.id`, but also restore and global consumer don't know it,

because

they don't have `group.id` set).


While I am totally in favor of the proposal, I am wondering how

we

intent to implement it in clean way? Or would we do ok to have some
internal client APIs that KS can use to "register" itself with the

client?

While clients must support both temporalities, the broker will

initially only send

GetTelemetrySubscriptionsResponse.DeltaTemporality=True


Not sure if I can follow. How make the decision about DELTA or

CUMULATIVE metrics? Should the broker side plugin not decide what

metrics

it what to receive in which form? So what does "initially" mean -- the
broker won't ship with a default plugin implementation?

The following method is added to the Producer, Consumer, and

Admin

client interfaces:


Should we add anything to Kafka Streams to expose the underlying

clients' assigned client-instance-ids programmatically? I am also

wondering

if clients should report their assigned client-instance-ids as metrics
itself (for this case, Kafka Streams won't need to do anything,

because

we

already expose all client metrics).


If we add anything programmatic, we need to make it simple,

given

that Kafka Streams has many clients per `StreamThread` and may have
multiple threads.

enable.metrics.push

It might be worth to add this to `StreamsConfig`, too? It set

via

StreamsConfig, we would forward it to all clients automatically.





-Matthias


On 9/29/23 5:45 PM, David Jacot wrote:

Hi Andrew,
Thanks for driving this one. I haven't read all the KIP yet

but I

already

have an initial question. In the Threading section, it is

written

"KafkaConsumer: the "background" thread (based on the consumer

threading

refactor which is underway)". If I understand this correctly,

it

means

that KIP-714 won't work if the "old consumer" is used. Am I

correct?

Cheers,
David
On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi Philip,
No, I do not think it should actively search for a broker that

supports

the new
RPCs. In general, either all of the brokers or none of the

brokers will

support it.
In the window, where the cluster is being upgraded or client

telemetry is

being
enabled, there might be a mixed situation. I wouldn’t put too

much effort

into
this mixed scenario. As the client finds brokers which support

the new

RPCs,
it can begin to follow the KIP-714 mechanism.

Thanks,
Andrew

On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com>

wrote:


Hi Andrew -

Question on top of your answers: Do you think the client

should

actively

search for a broker that supports this RPC? As previously

mentioned, the

broker uses the leastLoadedNode to find its first connection

(am

I correct?), and what if that broker doesn't support the

metric

push?


P

On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi Kirk,
Thanks for your question. You are correct that the presence

or

absence

of

the new RPCs in the
ApiVersionsResponse tells the client whether to request the

telemetry

subscriptions and push
metrics.

This is of course tricky in practice. It would be

conceivable,

as a

cluster is upgraded to AK 3.7
or as a client metrics receiver plugin is deployed across

the

cluster,

that a client connects to some
brokers that support the new RPCs and some that do not.

Here’s my suggestion:
* If a client is not connected to any brokers that support

in

the new

RPCs, it cannot push metrics.
* If a client is only connected to brokers that support the

new

RPCs, it

will use the new RPCs in
accordance with the KIP.
* If a client is connected to some brokers that support the

new

RPCs and

some that do not, it will
use the new RPCs with the supporting subset of brokers in

accordance

with

the KIP.

Comments?

Thanks,
Andrew

On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro>

wrote:


Hi Andrew/Jun,

I want to make sure I understand question/comment #119… In

the

case

where a cluster without a metrics client receiver is later

reconfigured

and

restarted to include a metrics client receiver, do we want

the

client to

thereafter begin pushing metrics to the cluster? From

Andrew’s

response

to

question #119, it sounds like we’re using the

presence/absence

of the

relevant RPCs in ApiVersionsResponse as the

to-push-or-not-to-push

indicator. Do I have that correct?


Thanks,
Kirk

On Sep 21, 2023, at 7:42 AM, Andrew Schofield <

andrew_schofield_j...@outlook.com> wrote:


Hi Jun,
Thanks for your comments. I’ve updated the KIP to clarify

where

necessary.


110. Yes, agree. The motivation section mentions this.

111. The replacement of ‘-‘ with ‘.’ for metric names and

the

replacement of

‘-‘ with ‘_’ for attribute keys is following the OTLP

guidelines. I

think it’s a bit

of a debatable point. OTLP makes a distinction between a

namespace

and a

multi-word component. If it was “client.id” then “client”

would be a

namespace with

an attribute key “id”. But “client_id” is just a key. So,

it

was

intentional, but debatable.


112. Thanks. The link target moved. Fixed.

113. Thanks. Fixed.

114.1. If a standard metric makes sense for a client, it

should use

the

exact same

name. If a standard metric doesn’t make sense for a

client,

then it

can

omit that metric.


For a required metric, the situation is stronger. All

clients

must

implement these

metrics with these names in order to implement the KIP.

But

the

required metrics

are essentially the number of connections and the request

latency,

which do not

reference the underlying implementation of the client

(which

producer.record.queue.time.max

of course does).

I suppose someone might build a producer-only client that

didn’t have

consumer metrics.

In this case, the consumer metrics would conceptually have

the value 0

and would not

need to be sent to the broker.

114.2. If a client does not implement some metrics, they

will

not be

available for

analysis and troubleshooting. It just makes the ability to

combine

metrics from lots

different clients less complete.

115. I think it was probably a mistake to be so specific

about

threading in this KIP.

When the consumer threading refactor is complete, of

course,

it would

do the appropriate

equivalent. I’ve added a clarification and massively

simplified this

section.


116. I removed “client.terminating”.

117. Yes. Horrid. Fixed.

118. The Terminating flag just indicates that this is the

final

PushTelemetryRequest

from this client. Any subsequent request will be

rejected. I

think

this

flag should remain.


119. Good catch. This was actually contradicting another

part

of the

KIP. The current behaviour

is indeed preserved. If the broker doesn’t have a client

metrics

receiver plugin, the new RPCs

in this KIP are “turned off” and not reported in

ApiVersionsResponse.

The client will not

attempt to push metrics.

120. The error handling table lists the error codes for

PushTelemetryResponse. I’ve added one

but it looked good to me. GetTelemetrySubscriptions

doesn’t

have any

error codes, since the

situation in which the client telemetry is not supported

is

handled by

the RPCs not being offered

by the broker.

121. Again, I think it’s probably a mistake to be specific

about

threading. Removed.


122. Good catch. For DescribeConfigs, the ACL operation

should be

“DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation

should be

“ALTER” (not “WRITE” as it said). The checks are made on

the

CLUSTER

resource.

Thanks for the detailed review.

Thanks,
Andrew


110. Another potential motivation is the multiple clients

support.

Some of

the places may not have good monitoring support for

non-java

clients.


111. OpenTelemetry Naming: We replace '-' with '.' for

metric name

and

replace '-' with '_' for attributes. Why is the

inconsistency?


112. OTLP specification: Page is not found from the link.

113. "Defining standard and required metrics makes the

monitoring and

troubleshooting of clients from various client types ":

Incomplete

sentence.


114. standard/required metrics
114.1 Do other clients need to implement those metrics

with

the exact

same

names?
114.2 What happens if some of those metrics are missing

from

client?


115. "KafkaConsumer: both the "heart beat" and

application

threads":

We

have an ongoing effort to refactor the consumer threading

model (

https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design

).

Once this is done, PRC requests will only be made from

the

background

thread. Should this KIP follow the new model only?

116. 'The metrics should contain the reason for the

client

termination

by

including the client.terminating metric with the label

“reason” ...'.

Hmm,

are we introducing a new metric client.terminating? If

so,

that needs

to be

explicitly listed.

117. "As the metrics plugin may need to add additional

metrics on top

of

this the generic metrics receiver in the broker will not

add

these

labels

but rely on the plugins to do so," The sentence doesn't

read

well.


118. "it is possible for the client to send at most one

accepted

out-of-profile per connection before the rate-limiter

kicks

in": If

we

do

this, do we still need the Terminating flag in

PushTelemetryRequestV0?


119. "If there is no client metrics receiver plugin

configured on the

broker, it will respond to

GetTelemetrySubscriptionsRequest

with

RequestedMetrics set to Null and a -1 SubscriptionId. The

client

should

send a new GetTelemetrySubscriptionsRequest after the

PushIntervalMs

has

expired. This allows the metrics receiver to be enabled

or

disabled

without

having to restart the broker or reset the client

connection."

"no client metrics receiver plugin configured" is defined

by

no

metric

reporter implementing the ClientTelemetry interface,

right?

In that

case,

it would be useful to avoid the clients sending
GetTelemetrySubscriptionsRequest periodically to preserve

the current

behavior.

120. GetTelemetrySubscriptionsResponseV0 and

PushTelemetryRequestV0:

Could

we list error codes for each?

121. "ClientTelemetryReceiver.ClientTelemetryReceiver

This

method may

be

called from the request handling thread": Where else can

this method

be

called?

122. DescribeConfigs/AlterConfigs already exist. Are we

changing the

ACL?


Thanks,

Jun

On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi Milind,
Thanks for your question.

On reflection, I agree that INVALID_RECORD is most

likely

to be

caused by a

problem in the serialization in the client. I have

changed

the

client

action in this case
to “Log an error and stop pushing metrics”.

I have updated the KIP text accordingly.

Thanks,
Andrew

On 31 Jul 2023, at 12:09, Milind Luthra

<milut...@confluent.io.INVALID>

wrote:


Hi Andrew,
Thanks for the clarifications.

About 2b:
In case a client has a bug while serializing, it might

be

difficult

for

the

client to recover from that without code changes. In

that,

it might

be

good

to just log the INVALID_RECORD as an error, and treat

the

error as

fatal

for the client (only fatal in terms of sending the

metrics, the

client

can

keep functioning otherwise). What do you think?

Thanks
Milind

On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi Milind,
Thanks for your questions about the KIP.

1) I did some archaeology and looked at historical

versions of the

KIP.

think this is
just a mistake. 5 minutes is the default metric push

interval. 30

minutes

is a mystery
to me. I’ve updated the KIP.

2) I think there are two situations in which

INVALID_RECORD might

occur.

a) The client might perhaps be using a content-type

that

the

broker

does

not support.
The KIP mentions content-type as a future extension,

but

there’s

only

one

supported
to start with. Until we have multiple content-types,

this

seems

out

of

scope. I think a
future KIP would add another error code for this.
b) The client might perhaps have a bug which means the

metrics

payload

is

malformed.
Logging a warning and attempting the next metrics push

on

the push

interval seems
appropriate.

UNKNOWN_SUBSCRIPTION_ID would indeed be handled by

making

an

immediate

GetTelemetrySubscriptionsRequest.

UNSUPPORTED_COMPRESSION_TYPE seems like either a

client

bug or

perhaps

a situation in which a broker sends a compression type

in

GetTelemetrySubscriptionsResponse
which is subsequently not supported when its used

with a

PushTelemetryRequest.
We do want the client to have the opportunity to get

an

up-to-date

list

of

supported
compression types. I think an immediate

GetTelemetrySubscriptionsRequest

is appropriate.

3) If a client attempts a subsequent handshake with a

Null

ClientInstanceId, the
receiving broker may not already know the client's

existing

ClientInstanceId. If the
receiving broker knows the existing ClientInstanceId,

it

simply

responds

the existing
value back to the client. If it does not know the

existing

ClientInstanceId, it will create
a new client instance ID and respond with that.

I will update the KIP with these clarifications.

Thanks,
Andrew

On 17 Jul 2023, at 14:21, Milind Luthra

<milut...@confluent.io.INVALID

wrote:


Hi Andrew, thanks for this KIP.

I had a few questions regarding the "Error handling"

section.


1. It mentions that "The 5 and 30 minute retries are

to

eventually

trigger

a retry and avoid having to restart clients if the

cluster

metrics

configuration is disabled temporarily, e.g., by

operator

error,

rolling

upgrades, etc."
But this 30 min interval isn't mentioned anywhere

else.

What is

it

referring to?

2. For the actual errors:
INVALID_RECORD : The action required is to "Log a

warning to the

application and schedule the next

GetTelemetrySubscriptionsRequest

to 5

minutes". Why is this 5 minutes, and not something

like

PushIntervalMs?

And

also, why are we scheduling a

GetTelemetrySubscriptionsRequest in

this

case, if the serialization is broken?
UNKNOWN_SUBSCRIPTION_ID ,

UNSUPPORTED_COMPRESSION_TYPE

just to

confirm,

the GetTelemetrySubscriptionsRequest needs to be

scheduled

immediately

after the PushTelemetry response, correct?

3. For "Subsequent GetTelemetrySubscriptionsRequests

must include

the

ClientInstanceId returned in the first response,

regardless of

broker":

Will a broker error be returned in case some

implementation of

this KIP

violates this accidentally and sends a request with

ClientInstanceId =

Null

even when it's been obtained already? Or will a new

ClientInstanceId be

returned without an error?

Thanks!

On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi,
I would like to start a new discussion thread on

KIP-714: Client

metrics

and observability.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability


I have edited the proposal significantly to reduce

the

scope.

The

overall

mechanism for client metric subscriptions is

unchanged,

but the

KIP is now based on the existing client metrics,

rather

than

introducing

new metrics. The purpose remains helping cluster

operators

investigate performance problems experienced by

clients

without

requiring

changes to the client application code or

configuration.


Thanks,
Andrew

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to