Re: [DISCUSS] KIP-714: Client metrics and observability

Matthias J. Sax Wed, 11 Oct 2023 13:46:27 -0700

Thanks!

On 10/10/23 11:31 PM, Andrew Schofield wrote:

Matthias,
Yes, I think that’s a sensible way forward and the interface you propose looks 
good. I’ll update the KIP accordingly.


Thanks,
Andrew

On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote:

Andrew,

yes I would like to get this change into KIP-714 right way. Seems to be 
important, as we don't know if/when a follow-up KIP for Kafka Streams would 
land.

I was also thinking (and discussed with a few others) how to expose it, and we 
would propose the following:

We add a new method to `KafkaStreams` class:

    public ClientsInstanceIds clientsInstanceIds(Duration timeout);

The returned object is like below:

  public class ClientsInstanceIds {
    // we only have a single admin client per KS instance
    String adminInstanceId();

    // we only have a single global consumer per KS instance (if any)
    // Optional<> because we might not have global-thread
    Optional<String> globalConsumerInstanceId();

    // return a <threadKey -> ClientInstanceId> mapping
    // for the underlying (restore-)consumers/producers
    Map<String, String> mainConsumerInstanceIds();
    Map<String, String> restoreConsumerInstanceIds();
    Map<String, String> producerInstanceIds();
}

For the `threadKey`, we would use some pattern like this:

  [Stream|StateUpdater]Thread-<threadIdx>


Would this work from your POV?



-Matthias


On 10/9/23 2:15 AM, Andrew Schofield wrote:

Hi Matthias,
Good point. Makes sense to me.
Is this something that can also be included in the proposed Kafka Streams 
follow-on KIP, or would you prefer that I add it to KIP-714?
I have a slight preference for the former to put all of the KS enhancements 
into a separate KIP.
Thanks,
Andrew

On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote:

Thanks Andrew. SGTM.

One point you did not address is the idea to add a method to `KafkaStreams` 
similar to the proposed `clientInstanceId()` that will be added to 
consumer/producer/admin clients.

Without addressing this, Kafka Streams users won't have a way to get the 
assigned `instanceId` of the internally created clients, and thus it would be 
very difficult for them to know which metrics that the broker receives belong 
to a Kafka Streams app. It seems they would only find the `instanceIds` in the 
log4j output if they enable client logging?

Of course, because there is multiple clients inside Kafka Streams, the return type cannot be an 
single "String", but must be some some complex data structure -- we could either add 
a new class, or return a Map<String,String> using a client key that maps to the 
`instanceId`.

For example we could use the following key:

   [Global]StreamThread[-<threadIndex>][-restore][consumer|producer]

(Of course, only the valid combination.)

Or maybe even better, we might want to return a `Future` because collection all 
the `instanceId` might be a blocking all on each client? I have already a few 
idea how it could be implemented but I don't think it must be discussed on the 
KIP, as it's an implementation detail.

Thoughts?


-Matthias

On 10/6/23 4:21 AM, Andrew Schofield wrote:

Hi Matthias,
Thanks for your comments. I agree that a follow-up KIP for Kafka Streams makes 
sense. This KIP currently has made a bit
of an effort to embrace KS, but it’s not enough by a long way.
I have removed `application.id <http://application.id/>`. This should be done 
properly in the follow-up KIP. I don’t believe there’s a downside to
removing it from this KIP.
I have reworded the statement about temporarily. In practice, the 
implementation of this KIP that’s going on while the voting
progresses happens to use delta temporality, but that’s an implementation 
detail. Supporting clients must support both
temporalities.
I thought about exposing the client instance ID as a metric, but non-numeric 
metrics are not usual practice and tools
do not universally support them. I don’t think the KIP is improved by adding 
one now.
I have also added constants for the various Config classes for 
ENABLE_METRICS_PUSH_CONFIG, including to
StreamsConfig. It’s best to be explicit about this.
Thanks,
Andrew

On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> wrote:

Hi,

I did not pay attention to this KIP in the past; seems it was on-hold for a 
while.

Overall it sounds very useful, and I think we should extend this with a follow 
up KIP for Kafka Streams. What is unclear to me at this point is the statement:

Kafka Streams applications have an application.id configured and this 
identifier should be included as the application_id metrics label.


The `application.id` is currently only used as the (main) consumer's `group.id` 
(and is part of an auto-generated `client.id` if the user does not set one).

This comment related to:

The following labels should be added by the client as appropriate before 
metrics are pushed.


Given that Kafka Streams uses the consumer/producer/admin client as "black 
boxes", a client does at this point not know that it's part of a Kafka Streams 
application, and thus, it won't be able to attach any such label to the metrics it sends. 
(Also producer and admin don't even know the value of `application.id` -- only the (main) 
consumer, indirectly via `group.id`, but also restore and global consumer don't know it, 
because they don't have `group.id` set).

While I am totally in favor of the proposal, I am wondering how we intent to implement it 
in clean way? Or would we do ok to have some internal client APIs that KS can use to 
"register" itself with the client?

While clients must support both temporalities, the broker will initially only 
send GetTelemetrySubscriptionsResponse.DeltaTemporality=True


Not sure if I can follow. How make the decision about DELTA or CUMULATIVE metrics? Should 
the broker side plugin not decide what metrics it what to receive in which form? So what 
does "initially" mean -- the broker won't ship with a default plugin 
implementation?

The following method is added to the Producer, Consumer, and Admin client 
interfaces:


Should we add anything to Kafka Streams to expose the underlying clients' 
assigned client-instance-ids programmatically? I am also wondering if clients 
should report their assigned client-instance-ids as metrics itself (for this 
case, Kafka Streams won't need to do anything, because we already expose all 
client metrics).

If we add anything programmatic, we need to make it simple, given that Kafka 
Streams has many clients per `StreamThread` and may have multiple threads.

enable.metrics.push

It might be worth to add this to `StreamsConfig`, too? It set via 
StreamsConfig, we would forward it to all clients automatically.




-Matthias


On 9/29/23 5:45 PM, David Jacot wrote:

Hi Andrew,
Thanks for driving this one. I haven't read all the KIP yet but I already
have an initial question. In the Threading section, it is written
"KafkaConsumer: the "background" thread (based on the consumer threading
refactor which is underway)". If I understand this correctly, it means
that KIP-714 won't work if the "old consumer" is used. Am I correct?
Cheers,
David
On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi Philip,
No, I do not think it should actively search for a broker that supports
the new
RPCs. In general, either all of the brokers or none of the brokers will
support it.
In the window, where the cluster is being upgraded or client telemetry is
being
enabled, there might be a mixed situation. I wouldn’t put too much effort
into
this mixed scenario. As the client finds brokers which support the new
RPCs,
it can begin to follow the KIP-714 mechanism.

Thanks,
Andrew

On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> wrote:

Hi Andrew -

Question on top of your answers: Do you think the client should actively
search for a broker that supports this RPC? As previously mentioned, the
broker uses the leastLoadedNode to find its first connection (am
I correct?), and what if that broker doesn't support the metric push?

P

On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi Kirk,
Thanks for your question. You are correct that the presence or absence

of

the new RPCs in the
ApiVersionsResponse tells the client whether to request the telemetry
subscriptions and push
metrics.

This is of course tricky in practice. It would be conceivable, as a
cluster is upgraded to AK 3.7
or as a client metrics receiver plugin is deployed across the cluster,
that a client connects to some
brokers that support the new RPCs and some that do not.

Here’s my suggestion:
* If a client is not connected to any brokers that support in the new
RPCs, it cannot push metrics.
* If a client is only connected to brokers that support the new RPCs, it
will use the new RPCs in
accordance with the KIP.
* If a client is connected to some brokers that support the new RPCs and
some that do not, it will
use the new RPCs with the supporting subset of brokers in accordance

with

the KIP.

Comments?

Thanks,
Andrew

On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote:

Hi Andrew/Jun,

I want to make sure I understand question/comment #119… In the case

where a cluster without a metrics client receiver is later reconfigured

and

restarted to include a metrics client receiver, do we want the client to
thereafter begin pushing metrics to the cluster? From Andrew’s response

to

question #119, it sounds like we’re using the presence/absence of the
relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push
indicator. Do I have that correct?


Thanks,
Kirk

On Sep 21, 2023, at 7:42 AM, Andrew Schofield <

andrew_schofield_j...@outlook.com> wrote:


Hi Jun,
Thanks for your comments. I’ve updated the KIP to clarify where

necessary.


110. Yes, agree. The motivation section mentions this.

111. The replacement of ‘-‘ with ‘.’ for metric names and the

replacement of

‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I

think it’s a bit

of a debatable point. OTLP makes a distinction between a namespace

and a

multi-word component. If it was “client.id” then “client” would be a

namespace with

an attribute key “id”. But “client_id” is just a key. So, it was

intentional, but debatable.


112. Thanks. The link target moved. Fixed.

113. Thanks. Fixed.

114.1. If a standard metric makes sense for a client, it should use

the

exact same

name. If a standard metric doesn’t make sense for a client, then it

can

omit that metric.


For a required metric, the situation is stronger. All clients must

implement these

metrics with these names in order to implement the KIP. But the

required metrics

are essentially the number of connections and the request latency,

which do not

reference the underlying implementation of the client (which

producer.record.queue.time.max

of course does).

I suppose someone might build a producer-only client that didn’t have

consumer metrics.

In this case, the consumer metrics would conceptually have the value 0

and would not

need to be sent to the broker.

114.2. If a client does not implement some metrics, they will not be

available for

analysis and troubleshooting. It just makes the ability to combine

metrics from lots

different clients less complete.

115. I think it was probably a mistake to be so specific about

threading in this KIP.

When the consumer threading refactor is complete, of course, it would

do the appropriate

equivalent. I’ve added a clarification and massively simplified this

section.


116. I removed “client.terminating”.

117. Yes. Horrid. Fixed.

118. The Terminating flag just indicates that this is the final

PushTelemetryRequest

from this client. Any subsequent request will be rejected. I think

this

flag should remain.


119. Good catch. This was actually contradicting another part of the

KIP. The current behaviour

is indeed preserved. If the broker doesn’t have a client metrics

receiver plugin, the new RPCs

in this KIP are “turned off” and not reported in ApiVersionsResponse.

The client will not

attempt to push metrics.

120. The error handling table lists the error codes for

PushTelemetryResponse. I’ve added one

but it looked good to me. GetTelemetrySubscriptions doesn’t have any

error codes, since the

situation in which the client telemetry is not supported is handled by

the RPCs not being offered

by the broker.

121. Again, I think it’s probably a mistake to be specific about

threading. Removed.


122. Good catch. For DescribeConfigs, the ACL operation should be
“DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be
“ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER
resource.

Thanks for the detailed review.

Thanks,
Andrew


110. Another potential motivation is the multiple clients support.

Some of

the places may not have good monitoring support for non-java clients.

111. OpenTelemetry Naming: We replace '-' with '.' for metric name

and

replace '-' with '_' for attributes. Why is the inconsistency?

112. OTLP specification: Page is not found from the link.

113. "Defining standard and required metrics makes the monitoring and
troubleshooting of clients from various client types ": Incomplete

sentence.


114. standard/required metrics
114.1 Do other clients need to implement those metrics with the exact

same

names?
114.2 What happens if some of those metrics are missing from a

client?


115. "KafkaConsumer: both the "heart beat" and application threads":

We

have an ongoing effort to refactor the consumer threading model (

https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design

).

Once this is done, PRC requests will only be made from the background
thread. Should this KIP follow the new model only?

116. 'The metrics should contain the reason for the client

termination

by

including the client.terminating metric with the label “reason” ...'.

Hmm,

are we introducing a new metric client.terminating? If so, that needs

to be

explicitly listed.

117. "As the metrics plugin may need to add additional metrics on top

of

this the generic metrics receiver in the broker will not add these

labels

but rely on the plugins to do so," The sentence doesn't read well.

118. "it is possible for the client to send at most one accepted
out-of-profile per connection before the rate-limiter kicks in": If

we

do

this, do we still need the Terminating flag in

PushTelemetryRequestV0?


119. "If there is no client metrics receiver plugin configured on the
broker, it will respond to GetTelemetrySubscriptionsRequest with
RequestedMetrics set to Null and a -1 SubscriptionId. The client

should

send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs

has

expired. This allows the metrics receiver to be enabled or disabled

without

having to restart the broker or reset the client connection."
"no client metrics receiver plugin configured" is defined by no

metric

reporter implementing the ClientTelemetry interface, right? In that

case,

it would be useful to avoid the clients sending
GetTelemetrySubscriptionsRequest periodically to preserve the current
behavior.

120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0:

Could

we list error codes for each?

121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may

be

called from the request handling thread": Where else can this method

be

called?

122. DescribeConfigs/AlterConfigs already exist. Are we changing the

ACL?


Thanks,

Jun

On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi Milind,
Thanks for your question.

On reflection, I agree that INVALID_RECORD is most likely to be

caused by a

problem in the serialization in the client. I have changed the

client

action in this case
to “Log an error and stop pushing metrics”.

I have updated the KIP text accordingly.

Thanks,
Andrew

On 31 Jul 2023, at 12:09, Milind Luthra

<milut...@confluent.io.INVALID>

wrote:


Hi Andrew,
Thanks for the clarifications.

About 2b:
In case a client has a bug while serializing, it might be difficult

for

the

client to recover from that without code changes. In that, it might

be

good

to just log the INVALID_RECORD as an error, and treat the error as

fatal

for the client (only fatal in terms of sending the metrics, the

client

can

keep functioning otherwise). What do you think?

Thanks
Milind

On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi Milind,
Thanks for your questions about the KIP.

1) I did some archaeology and looked at historical versions of the

KIP.

think this is
just a mistake. 5 minutes is the default metric push interval. 30

minutes

is a mystery
to me. I’ve updated the KIP.

2) I think there are two situations in which INVALID_RECORD might

occur.

a) The client might perhaps be using a content-type that the

broker

does

not support.
The KIP mentions content-type as a future extension, but there’s

only

one

supported
to start with. Until we have multiple content-types, this seems

out

of

scope. I think a
future KIP would add another error code for this.
b) The client might perhaps have a bug which means the metrics

payload

is

malformed.
Logging a warning and attempting the next metrics push on the push
interval seems
appropriate.

UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an

immediate

GetTelemetrySubscriptionsRequest.

UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or

perhaps

a situation in which a broker sends a compression type in a
GetTelemetrySubscriptionsResponse
which is subsequently not supported when its used with a
PushTelemetryRequest.
We do want the client to have the opportunity to get an up-to-date

list

of

supported
compression types. I think an immediate

GetTelemetrySubscriptionsRequest

is appropriate.

3) If a client attempts a subsequent handshake with a Null
ClientInstanceId, the
receiving broker may not already know the client's existing
ClientInstanceId. If the
receiving broker knows the existing ClientInstanceId, it simply

responds

the existing
value back to the client. If it does not know the existing
ClientInstanceId, it will create
a new client instance ID and respond with that.

I will update the KIP with these clarifications.

Thanks,
Andrew

On 17 Jul 2023, at 14:21, Milind Luthra

<milut...@confluent.io.INVALID

wrote:


Hi Andrew, thanks for this KIP.

I had a few questions regarding the "Error handling" section.

1. It mentions that "The 5 and 30 minute retries are to

eventually

trigger

a retry and avoid having to restart clients if the cluster

metrics

configuration is disabled temporarily, e.g., by operator error,

rolling

upgrades, etc."
But this 30 min interval isn't mentioned anywhere else. What is

it

referring to?

2. For the actual errors:
INVALID_RECORD : The action required is to "Log a warning to the
application and schedule the next

GetTelemetrySubscriptionsRequest

to 5

minutes". Why is this 5 minutes, and not something like

PushIntervalMs?

And

also, why are we scheduling a GetTelemetrySubscriptionsRequest in

this

case, if the serialization is broken?
UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to

confirm,

the GetTelemetrySubscriptionsRequest needs to be scheduled

immediately

after the PushTelemetry response, correct?

3. For "Subsequent GetTelemetrySubscriptionsRequests must include

the

ClientInstanceId returned in the first response, regardless of

broker":

Will a broker error be returned in case some implementation of

this KIP

violates this accidentally and sends a request with

ClientInstanceId =

Null

even when it's been obtained already? Or will a new

ClientInstanceId be

returned without an error?

Thanks!

On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

Hi,
I would like to start a new discussion thread on KIP-714: Client

metrics

and observability.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability


I have edited the proposal significantly to reduce the scope.

The

overall

mechanism for client metric subscriptions is unchanged, but the
KIP is now based on the existing client metrics, rather than

introducing

new metrics. The purpose remains helping cluster operators
investigate performance problems experienced by clients without

requiring

changes to the client application code or configuration.

Thanks,
Andrew

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to