Hi Andrew - Question on top of your answers: Do you think the client should actively search for a broker that supports this RPC? As previously mentioned, the broker uses the leastLoadedNode to find its first connection (am I correct?), and what if that broker doesn't support the metric push?
P On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield < andrew_schofield_j...@outlook.com> wrote: > Hi Kirk, > Thanks for your question. You are correct that the presence or absence of > the new RPCs in the > ApiVersionsResponse tells the client whether to request the telemetry > subscriptions and push > metrics. > > This is of course tricky in practice. It would be conceivable, as a > cluster is upgraded to AK 3.7 > or as a client metrics receiver plugin is deployed across the cluster, > that a client connects to some > brokers that support the new RPCs and some that do not. > > Here’s my suggestion: > * If a client is not connected to any brokers that support in the new > RPCs, it cannot push metrics. > * If a client is only connected to brokers that support the new RPCs, it > will use the new RPCs in > accordance with the KIP. > * If a client is connected to some brokers that support the new RPCs and > some that do not, it will > use the new RPCs with the supporting subset of brokers in accordance with > the KIP. > > Comments? > > Thanks, > Andrew > > > On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote: > > > > Hi Andrew/Jun, > > > > I want to make sure I understand question/comment #119… In the case > where a cluster without a metrics client receiver is later reconfigured and > restarted to include a metrics client receiver, do we want the client to > thereafter begin pushing metrics to the cluster? From Andrew’s response to > question #119, it sounds like we’re using the presence/absence of the > relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push > indicator. Do I have that correct? > > > > Thanks, > > Kirk > > > >> On Sep 21, 2023, at 7:42 AM, Andrew Schofield < > andrew_schofield_j...@outlook.com> wrote: > >> > >> Hi Jun, > >> Thanks for your comments. I’ve updated the KIP to clarify where > necessary. > >> > >> 110. Yes, agree. The motivation section mentions this. > >> > >> 111. The replacement of ‘-‘ with ‘.’ for metric names and the > replacement of > >> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I > think it’s a bit > >> of a debatable point. OTLP makes a distinction between a namespace and a > >> multi-word component. If it was “client.id” then “client” would be a > namespace with > >> an attribute key “id”. But “client_id” is just a key. So, it was > intentional, but debatable. > >> > >> 112. Thanks. The link target moved. Fixed. > >> > >> 113. Thanks. Fixed. > >> > >> 114.1. If a standard metric makes sense for a client, it should use the > exact same > >> name. If a standard metric doesn’t make sense for a client, then it can > omit that metric. > >> > >> For a required metric, the situation is stronger. All clients must > implement these > >> metrics with these names in order to implement the KIP. But the > required metrics > >> are essentially the number of connections and the request latency, > which do not > >> reference the underlying implementation of the client (which > producer.record.queue.time.max > >> of course does). > >> > >> I suppose someone might build a producer-only client that didn’t have > consumer metrics. > >> In this case, the consumer metrics would conceptually have the value 0 > and would not > >> need to be sent to the broker. > >> > >> 114.2. If a client does not implement some metrics, they will not be > available for > >> analysis and troubleshooting. It just makes the ability to combine > metrics from lots > >> different clients less complete. > >> > >> 115. I think it was probably a mistake to be so specific about > threading in this KIP. > >> When the consumer threading refactor is complete, of course, it would > do the appropriate > >> equivalent. I’ve added a clarification and massively simplified this > section. > >> > >> 116. I removed “client.terminating”. > >> > >> 117. Yes. Horrid. Fixed. > >> > >> 118. The Terminating flag just indicates that this is the final > PushTelemetryRequest > >> from this client. Any subsequent request will be rejected. I think this > flag should remain. > >> > >> 119. Good catch. This was actually contradicting another part of the > KIP. The current behaviour > >> is indeed preserved. If the broker doesn’t have a client metrics > receiver plugin, the new RPCs > >> in this KIP are “turned off” and not reported in ApiVersionsResponse. > The client will not > >> attempt to push metrics. > >> > >> 120. The error handling table lists the error codes for > PushTelemetryResponse. I’ve added one > >> but it looked good to me. GetTelemetrySubscriptions doesn’t have any > error codes, since the > >> situation in which the client telemetry is not supported is handled by > the RPCs not being offered > >> by the broker. > >> > >> 121. Again, I think it’s probably a mistake to be specific about > threading. Removed. > >> > >> 122. Good catch. For DescribeConfigs, the ACL operation should be > >> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be > >> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER > >> resource. > >> > >> Thanks for the detailed review. > >> > >> Thanks, > >> Andrew > >> > >>> > >>> 110. Another potential motivation is the multiple clients support. > Some of > >>> the places may not have good monitoring support for non-java clients. > >>> > >>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name and > >>> replace '-' with '_' for attributes. Why is the inconsistency? > >>> > >>> 112. OTLP specification: Page is not found from the link. > >>> > >>> 113. "Defining standard and required metrics makes the monitoring and > >>> troubleshooting of clients from various client types ": Incomplete > sentence. > >>> > >>> 114. standard/required metrics > >>> 114.1 Do other clients need to implement those metrics with the exact > same > >>> names? > >>> 114.2 What happens if some of those metrics are missing from a client? > >>> > >>> 115. "KafkaConsumer: both the "heart beat" and application threads": We > >>> have an ongoing effort to refactor the consumer threading model ( > >>> > https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design > ). > >>> Once this is done, PRC requests will only be made from the background > >>> thread. Should this KIP follow the new model only? > >>> > >>> 116. 'The metrics should contain the reason for the client termination > by > >>> including the client.terminating metric with the label “reason” ...'. > Hmm, > >>> are we introducing a new metric client.terminating? If so, that needs > to be > >>> explicitly listed. > >>> > >>> 117. "As the metrics plugin may need to add additional metrics on top > of > >>> this the generic metrics receiver in the broker will not add these > labels > >>> but rely on the plugins to do so," The sentence doesn't read well. > >>> > >>> 118. "it is possible for the client to send at most one accepted > >>> out-of-profile per connection before the rate-limiter kicks in": If we > do > >>> this, do we still need the Terminating flag in PushTelemetryRequestV0? > >>> > >>> 119. "If there is no client metrics receiver plugin configured on the > >>> broker, it will respond to GetTelemetrySubscriptionsRequest with > >>> RequestedMetrics set to Null and a -1 SubscriptionId. The client should > >>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs > has > >>> expired. This allows the metrics receiver to be enabled or disabled > without > >>> having to restart the broker or reset the client connection." > >>> "no client metrics receiver plugin configured" is defined by no metric > >>> reporter implementing the ClientTelemetry interface, right? In that > case, > >>> it would be useful to avoid the clients sending > >>> GetTelemetrySubscriptionsRequest periodically to preserve the current > >>> behavior. > >>> > >>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0: > Could > >>> we list error codes for each? > >>> > >>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may > be > >>> called from the request handling thread": Where else can this method be > >>> called? > >>> > >>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the > ACL? > >>> > >>> Thanks, > >>> > >>> Jun > >>> > >>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield < > >>> andrew_schofield_j...@outlook.com> wrote: > >>> > >>>> Hi Milind, > >>>> Thanks for your question. > >>>> > >>>> On reflection, I agree that INVALID_RECORD is most likely to be > caused by a > >>>> problem in the serialization in the client. I have changed the client > >>>> action in this case > >>>> to “Log an error and stop pushing metrics”. > >>>> > >>>> I have updated the KIP text accordingly. > >>>> > >>>> Thanks, > >>>> Andrew > >>>> > >>>>> On 31 Jul 2023, at 12:09, Milind Luthra > <milut...@confluent.io.INVALID> > >>>> wrote: > >>>>> > >>>>> Hi Andrew, > >>>>> Thanks for the clarifications. > >>>>> > >>>>> About 2b: > >>>>> In case a client has a bug while serializing, it might be difficult > for > >>>> the > >>>>> client to recover from that without code changes. In that, it might > be > >>>> good > >>>>> to just log the INVALID_RECORD as an error, and treat the error as > fatal > >>>>> for the client (only fatal in terms of sending the metrics, the > client > >>>> can > >>>>> keep functioning otherwise). What do you think? > >>>>> > >>>>> Thanks > >>>>> Milind > >>>>> > >>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < > >>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>> > >>>>>> Hi Milind, > >>>>>> Thanks for your questions about the KIP. > >>>>>> > >>>>>> 1) I did some archaeology and looked at historical versions of the > KIP. > >>>> I > >>>>>> think this is > >>>>>> just a mistake. 5 minutes is the default metric push interval. 30 > >>>> minutes > >>>>>> is a mystery > >>>>>> to me. I’ve updated the KIP. > >>>>>> > >>>>>> 2) I think there are two situations in which INVALID_RECORD might > occur. > >>>>>> a) The client might perhaps be using a content-type that the broker > does > >>>>>> not support. > >>>>>> The KIP mentions content-type as a future extension, but there’s > only > >>>> one > >>>>>> supported > >>>>>> to start with. Until we have multiple content-types, this seems out > of > >>>>>> scope. I think a > >>>>>> future KIP would add another error code for this. > >>>>>> b) The client might perhaps have a bug which means the metrics > payload > >>>> is > >>>>>> malformed. > >>>>>> Logging a warning and attempting the next metrics push on the push > >>>>>> interval seems > >>>>>> appropriate. > >>>>>> > >>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an > immediate > >>>>>> GetTelemetrySubscriptionsRequest. > >>>>>> > >>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or > perhaps > >>>>>> a situation in which a broker sends a compression type in a > >>>>>> GetTelemetrySubscriptionsResponse > >>>>>> which is subsequently not supported when its used with a > >>>>>> PushTelemetryRequest. > >>>>>> We do want the client to have the opportunity to get an up-to-date > list > >>>> of > >>>>>> supported > >>>>>> compression types. I think an immediate > GetTelemetrySubscriptionsRequest > >>>>>> is appropriate. > >>>>>> > >>>>>> 3) If a client attempts a subsequent handshake with a Null > >>>>>> ClientInstanceId, the > >>>>>> receiving broker may not already know the client's existing > >>>>>> ClientInstanceId. If the > >>>>>> receiving broker knows the existing ClientInstanceId, it simply > responds > >>>>>> the existing > >>>>>> value back to the client. If it does not know the existing > >>>>>> ClientInstanceId, it will create > >>>>>> a new client instance ID and respond with that. > >>>>>> > >>>>>> I will update the KIP with these clarifications. > >>>>>> > >>>>>> Thanks, > >>>>>> Andrew > >>>>>> > >>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra > <milut...@confluent.io.INVALID > >>>>> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi Andrew, thanks for this KIP. > >>>>>>> > >>>>>>> I had a few questions regarding the "Error handling" section. > >>>>>>> > >>>>>>> 1. It mentions that "The 5 and 30 minute retries are to eventually > >>>>>> trigger > >>>>>>> a retry and avoid having to restart clients if the cluster metrics > >>>>>>> configuration is disabled temporarily, e.g., by operator error, > rolling > >>>>>>> upgrades, etc." > >>>>>>> But this 30 min interval isn't mentioned anywhere else. What is it > >>>>>>> referring to? > >>>>>>> > >>>>>>> 2. For the actual errors: > >>>>>>> INVALID_RECORD : The action required is to "Log a warning to the > >>>>>>> application and schedule the next GetTelemetrySubscriptionsRequest > to 5 > >>>>>>> minutes". Why is this 5 minutes, and not something like > PushIntervalMs? > >>>>>> And > >>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in > this > >>>>>>> case, if the serialization is broken? > >>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to > >>>> confirm, > >>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled > immediately > >>>>>>> after the PushTelemetry response, correct? > >>>>>>> > >>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include > the > >>>>>>> ClientInstanceId returned in the first response, regardless of > broker": > >>>>>>> Will a broker error be returned in case some implementation of > this KIP > >>>>>>> violates this accidentally and sends a request with > ClientInstanceId = > >>>>>> Null > >>>>>>> even when it's been obtained already? Or will a new > ClientInstanceId be > >>>>>>> returned without an error? > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < > >>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> I would like to start a new discussion thread on KIP-714: Client > >>>> metrics > >>>>>>>> and observability. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability > >>>>>>>> > >>>>>>>> I have edited the proposal significantly to reduce the scope. The > >>>>>> overall > >>>>>>>> mechanism for client metric subscriptions is unchanged, but the > >>>>>>>> KIP is now based on the existing client metrics, rather than > >>>> introducing > >>>>>>>> new metrics. The purpose remains helping cluster operators > >>>>>>>> investigate performance problems experienced by clients without > >>>>>> requiring > >>>>>>>> changes to the client application code or configuration. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Andrew > > >