Hi Andrew, Thanks for driving this one. I haven't read all the KIP yet but I already have an initial question. In the Threading section, it is written "KafkaConsumer: the "background" thread (based on the consumer threading refactor which is underway)". If I understand this correctly, it means that KIP-714 won't work if the "old consumer" is used. Am I correct?
Cheers, David On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield < andrew_schofield_j...@outlook.com> wrote: > Hi Philip, > No, I do not think it should actively search for a broker that supports > the new > RPCs. In general, either all of the brokers or none of the brokers will > support it. > In the window, where the cluster is being upgraded or client telemetry is > being > enabled, there might be a mixed situation. I wouldn’t put too much effort > into > this mixed scenario. As the client finds brokers which support the new > RPCs, > it can begin to follow the KIP-714 mechanism. > > Thanks, > Andrew > > > On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> wrote: > > > > Hi Andrew - > > > > Question on top of your answers: Do you think the client should actively > > search for a broker that supports this RPC? As previously mentioned, the > > broker uses the leastLoadedNode to find its first connection (am > > I correct?), and what if that broker doesn't support the metric push? > > > > P > > > > On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield < > > andrew_schofield_j...@outlook.com> wrote: > > > >> Hi Kirk, > >> Thanks for your question. You are correct that the presence or absence > of > >> the new RPCs in the > >> ApiVersionsResponse tells the client whether to request the telemetry > >> subscriptions and push > >> metrics. > >> > >> This is of course tricky in practice. It would be conceivable, as a > >> cluster is upgraded to AK 3.7 > >> or as a client metrics receiver plugin is deployed across the cluster, > >> that a client connects to some > >> brokers that support the new RPCs and some that do not. > >> > >> Here’s my suggestion: > >> * If a client is not connected to any brokers that support in the new > >> RPCs, it cannot push metrics. > >> * If a client is only connected to brokers that support the new RPCs, it > >> will use the new RPCs in > >> accordance with the KIP. > >> * If a client is connected to some brokers that support the new RPCs and > >> some that do not, it will > >> use the new RPCs with the supporting subset of brokers in accordance > with > >> the KIP. > >> > >> Comments? > >> > >> Thanks, > >> Andrew > >> > >>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote: > >>> > >>> Hi Andrew/Jun, > >>> > >>> I want to make sure I understand question/comment #119… In the case > >> where a cluster without a metrics client receiver is later reconfigured > and > >> restarted to include a metrics client receiver, do we want the client to > >> thereafter begin pushing metrics to the cluster? From Andrew’s response > to > >> question #119, it sounds like we’re using the presence/absence of the > >> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push > >> indicator. Do I have that correct? > >>> > >>> Thanks, > >>> Kirk > >>> > >>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield < > >> andrew_schofield_j...@outlook.com> wrote: > >>>> > >>>> Hi Jun, > >>>> Thanks for your comments. I’ve updated the KIP to clarify where > >> necessary. > >>>> > >>>> 110. Yes, agree. The motivation section mentions this. > >>>> > >>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the > >> replacement of > >>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I > >> think it’s a bit > >>>> of a debatable point. OTLP makes a distinction between a namespace > and a > >>>> multi-word component. If it was “client.id” then “client” would be a > >> namespace with > >>>> an attribute key “id”. But “client_id” is just a key. So, it was > >> intentional, but debatable. > >>>> > >>>> 112. Thanks. The link target moved. Fixed. > >>>> > >>>> 113. Thanks. Fixed. > >>>> > >>>> 114.1. If a standard metric makes sense for a client, it should use > the > >> exact same > >>>> name. If a standard metric doesn’t make sense for a client, then it > can > >> omit that metric. > >>>> > >>>> For a required metric, the situation is stronger. All clients must > >> implement these > >>>> metrics with these names in order to implement the KIP. But the > >> required metrics > >>>> are essentially the number of connections and the request latency, > >> which do not > >>>> reference the underlying implementation of the client (which > >> producer.record.queue.time.max > >>>> of course does). > >>>> > >>>> I suppose someone might build a producer-only client that didn’t have > >> consumer metrics. > >>>> In this case, the consumer metrics would conceptually have the value 0 > >> and would not > >>>> need to be sent to the broker. > >>>> > >>>> 114.2. If a client does not implement some metrics, they will not be > >> available for > >>>> analysis and troubleshooting. It just makes the ability to combine > >> metrics from lots > >>>> different clients less complete. > >>>> > >>>> 115. I think it was probably a mistake to be so specific about > >> threading in this KIP. > >>>> When the consumer threading refactor is complete, of course, it would > >> do the appropriate > >>>> equivalent. I’ve added a clarification and massively simplified this > >> section. > >>>> > >>>> 116. I removed “client.terminating”. > >>>> > >>>> 117. Yes. Horrid. Fixed. > >>>> > >>>> 118. The Terminating flag just indicates that this is the final > >> PushTelemetryRequest > >>>> from this client. Any subsequent request will be rejected. I think > this > >> flag should remain. > >>>> > >>>> 119. Good catch. This was actually contradicting another part of the > >> KIP. The current behaviour > >>>> is indeed preserved. If the broker doesn’t have a client metrics > >> receiver plugin, the new RPCs > >>>> in this KIP are “turned off” and not reported in ApiVersionsResponse. > >> The client will not > >>>> attempt to push metrics. > >>>> > >>>> 120. The error handling table lists the error codes for > >> PushTelemetryResponse. I’ve added one > >>>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any > >> error codes, since the > >>>> situation in which the client telemetry is not supported is handled by > >> the RPCs not being offered > >>>> by the broker. > >>>> > >>>> 121. Again, I think it’s probably a mistake to be specific about > >> threading. Removed. > >>>> > >>>> 122. Good catch. For DescribeConfigs, the ACL operation should be > >>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be > >>>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER > >>>> resource. > >>>> > >>>> Thanks for the detailed review. > >>>> > >>>> Thanks, > >>>> Andrew > >>>> > >>>>> > >>>>> 110. Another potential motivation is the multiple clients support. > >> Some of > >>>>> the places may not have good monitoring support for non-java clients. > >>>>> > >>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name > and > >>>>> replace '-' with '_' for attributes. Why is the inconsistency? > >>>>> > >>>>> 112. OTLP specification: Page is not found from the link. > >>>>> > >>>>> 113. "Defining standard and required metrics makes the monitoring and > >>>>> troubleshooting of clients from various client types ": Incomplete > >> sentence. > >>>>> > >>>>> 114. standard/required metrics > >>>>> 114.1 Do other clients need to implement those metrics with the exact > >> same > >>>>> names? > >>>>> 114.2 What happens if some of those metrics are missing from a > client? > >>>>> > >>>>> 115. "KafkaConsumer: both the "heart beat" and application threads": > We > >>>>> have an ongoing effort to refactor the consumer threading model ( > >>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design > >> ). > >>>>> Once this is done, PRC requests will only be made from the background > >>>>> thread. Should this KIP follow the new model only? > >>>>> > >>>>> 116. 'The metrics should contain the reason for the client > termination > >> by > >>>>> including the client.terminating metric with the label “reason” ...'. > >> Hmm, > >>>>> are we introducing a new metric client.terminating? If so, that needs > >> to be > >>>>> explicitly listed. > >>>>> > >>>>> 117. "As the metrics plugin may need to add additional metrics on top > >> of > >>>>> this the generic metrics receiver in the broker will not add these > >> labels > >>>>> but rely on the plugins to do so," The sentence doesn't read well. > >>>>> > >>>>> 118. "it is possible for the client to send at most one accepted > >>>>> out-of-profile per connection before the rate-limiter kicks in": If > we > >> do > >>>>> this, do we still need the Terminating flag in > PushTelemetryRequestV0? > >>>>> > >>>>> 119. "If there is no client metrics receiver plugin configured on the > >>>>> broker, it will respond to GetTelemetrySubscriptionsRequest with > >>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The client > should > >>>>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs > >> has > >>>>> expired. This allows the metrics receiver to be enabled or disabled > >> without > >>>>> having to restart the broker or reset the client connection." > >>>>> "no client metrics receiver plugin configured" is defined by no > metric > >>>>> reporter implementing the ClientTelemetry interface, right? In that > >> case, > >>>>> it would be useful to avoid the clients sending > >>>>> GetTelemetrySubscriptionsRequest periodically to preserve the current > >>>>> behavior. > >>>>> > >>>>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0: > >> Could > >>>>> we list error codes for each? > >>>>> > >>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may > >> be > >>>>> called from the request handling thread": Where else can this method > be > >>>>> called? > >>>>> > >>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the > >> ACL? > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Jun > >>>>> > >>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield < > >>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>> > >>>>>> Hi Milind, > >>>>>> Thanks for your question. > >>>>>> > >>>>>> On reflection, I agree that INVALID_RECORD is most likely to be > >> caused by a > >>>>>> problem in the serialization in the client. I have changed the > client > >>>>>> action in this case > >>>>>> to “Log an error and stop pushing metrics”. > >>>>>> > >>>>>> I have updated the KIP text accordingly. > >>>>>> > >>>>>> Thanks, > >>>>>> Andrew > >>>>>> > >>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra > >> <milut...@confluent.io.INVALID> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi Andrew, > >>>>>>> Thanks for the clarifications. > >>>>>>> > >>>>>>> About 2b: > >>>>>>> In case a client has a bug while serializing, it might be difficult > >> for > >>>>>> the > >>>>>>> client to recover from that without code changes. In that, it might > >> be > >>>>>> good > >>>>>>> to just log the INVALID_RECORD as an error, and treat the error as > >> fatal > >>>>>>> for the client (only fatal in terms of sending the metrics, the > >> client > >>>>>> can > >>>>>>> keep functioning otherwise). What do you think? > >>>>>>> > >>>>>>> Thanks > >>>>>>> Milind > >>>>>>> > >>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < > >>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>> > >>>>>>>> Hi Milind, > >>>>>>>> Thanks for your questions about the KIP. > >>>>>>>> > >>>>>>>> 1) I did some archaeology and looked at historical versions of the > >> KIP. > >>>>>> I > >>>>>>>> think this is > >>>>>>>> just a mistake. 5 minutes is the default metric push interval. 30 > >>>>>> minutes > >>>>>>>> is a mystery > >>>>>>>> to me. I’ve updated the KIP. > >>>>>>>> > >>>>>>>> 2) I think there are two situations in which INVALID_RECORD might > >> occur. > >>>>>>>> a) The client might perhaps be using a content-type that the > broker > >> does > >>>>>>>> not support. > >>>>>>>> The KIP mentions content-type as a future extension, but there’s > >> only > >>>>>> one > >>>>>>>> supported > >>>>>>>> to start with. Until we have multiple content-types, this seems > out > >> of > >>>>>>>> scope. I think a > >>>>>>>> future KIP would add another error code for this. > >>>>>>>> b) The client might perhaps have a bug which means the metrics > >> payload > >>>>>> is > >>>>>>>> malformed. > >>>>>>>> Logging a warning and attempting the next metrics push on the push > >>>>>>>> interval seems > >>>>>>>> appropriate. > >>>>>>>> > >>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an > >> immediate > >>>>>>>> GetTelemetrySubscriptionsRequest. > >>>>>>>> > >>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or > >> perhaps > >>>>>>>> a situation in which a broker sends a compression type in a > >>>>>>>> GetTelemetrySubscriptionsResponse > >>>>>>>> which is subsequently not supported when its used with a > >>>>>>>> PushTelemetryRequest. > >>>>>>>> We do want the client to have the opportunity to get an up-to-date > >> list > >>>>>> of > >>>>>>>> supported > >>>>>>>> compression types. I think an immediate > >> GetTelemetrySubscriptionsRequest > >>>>>>>> is appropriate. > >>>>>>>> > >>>>>>>> 3) If a client attempts a subsequent handshake with a Null > >>>>>>>> ClientInstanceId, the > >>>>>>>> receiving broker may not already know the client's existing > >>>>>>>> ClientInstanceId. If the > >>>>>>>> receiving broker knows the existing ClientInstanceId, it simply > >> responds > >>>>>>>> the existing > >>>>>>>> value back to the client. If it does not know the existing > >>>>>>>> ClientInstanceId, it will create > >>>>>>>> a new client instance ID and respond with that. > >>>>>>>> > >>>>>>>> I will update the KIP with these clarifications. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Andrew > >>>>>>>> > >>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra > >> <milut...@confluent.io.INVALID > >>>>>>> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Hi Andrew, thanks for this KIP. > >>>>>>>>> > >>>>>>>>> I had a few questions regarding the "Error handling" section. > >>>>>>>>> > >>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to > eventually > >>>>>>>> trigger > >>>>>>>>> a retry and avoid having to restart clients if the cluster > metrics > >>>>>>>>> configuration is disabled temporarily, e.g., by operator error, > >> rolling > >>>>>>>>> upgrades, etc." > >>>>>>>>> But this 30 min interval isn't mentioned anywhere else. What is > it > >>>>>>>>> referring to? > >>>>>>>>> > >>>>>>>>> 2. For the actual errors: > >>>>>>>>> INVALID_RECORD : The action required is to "Log a warning to the > >>>>>>>>> application and schedule the next > GetTelemetrySubscriptionsRequest > >> to 5 > >>>>>>>>> minutes". Why is this 5 minutes, and not something like > >> PushIntervalMs? > >>>>>>>> And > >>>>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in > >> this > >>>>>>>>> case, if the serialization is broken? > >>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to > >>>>>> confirm, > >>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled > >> immediately > >>>>>>>>> after the PushTelemetry response, correct? > >>>>>>>>> > >>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include > >> the > >>>>>>>>> ClientInstanceId returned in the first response, regardless of > >> broker": > >>>>>>>>> Will a broker error be returned in case some implementation of > >> this KIP > >>>>>>>>> violates this accidentally and sends a request with > >> ClientInstanceId = > >>>>>>>> Null > >>>>>>>>> even when it's been obtained already? Or will a new > >> ClientInstanceId be > >>>>>>>>> returned without an error? > >>>>>>>>> > >>>>>>>>> Thanks! > >>>>>>>>> > >>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < > >>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> I would like to start a new discussion thread on KIP-714: Client > >>>>>> metrics > >>>>>>>>>> and observability. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability > >>>>>>>>>> > >>>>>>>>>> I have edited the proposal significantly to reduce the scope. > The > >>>>>>>> overall > >>>>>>>>>> mechanism for client metric subscriptions is unchanged, but the > >>>>>>>>>> KIP is now based on the existing client metrics, rather than > >>>>>> introducing > >>>>>>>>>> new metrics. The purpose remains helping cluster operators > >>>>>>>>>> investigate performance problems experienced by clients without > >>>>>>>> requiring > >>>>>>>>>> changes to the client application code or configuration. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Andrew > > >