Hi Andrew/Jun, I want to make sure I understand question/comment #119… In the case where a cluster without a metrics client receiver is later reconfigured and restarted to include a metrics client receiver, do we want the client to thereafter begin pushing metrics to the cluster? From Andrew’s response to question #119, it sounds like we’re using the presence/absence of the relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push indicator. Do I have that correct?
Thanks, Kirk > On Sep 21, 2023, at 7:42 AM, Andrew Schofield > <andrew_schofield_j...@outlook.com> wrote: > > Hi Jun, > Thanks for your comments. I’ve updated the KIP to clarify where necessary. > > 110. Yes, agree. The motivation section mentions this. > > 111. The replacement of ‘-‘ with ‘.’ for metric names and the replacement of > ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I think > it’s a bit > of a debatable point. OTLP makes a distinction between a namespace and a > multi-word component. If it was “client.id” then “client” would be a > namespace with > an attribute key “id”. But “client_id” is just a key. So, it was intentional, > but debatable. > > 112. Thanks. The link target moved. Fixed. > > 113. Thanks. Fixed. > > 114.1. If a standard metric makes sense for a client, it should use the exact > same > name. If a standard metric doesn’t make sense for a client, then it can omit > that metric. > > For a required metric, the situation is stronger. All clients must implement > these > metrics with these names in order to implement the KIP. But the required > metrics > are essentially the number of connections and the request latency, which do > not > reference the underlying implementation of the client (which > producer.record.queue.time.max > of course does). > > I suppose someone might build a producer-only client that didn’t have > consumer metrics. > In this case, the consumer metrics would conceptually have the value 0 and > would not > need to be sent to the broker. > > 114.2. If a client does not implement some metrics, they will not be > available for > analysis and troubleshooting. It just makes the ability to combine metrics > from lots > different clients less complete. > > 115. I think it was probably a mistake to be so specific about threading in > this KIP. > When the consumer threading refactor is complete, of course, it would do the > appropriate > equivalent. I’ve added a clarification and massively simplified this section. > > 116. I removed “client.terminating”. > > 117. Yes. Horrid. Fixed. > > 118. The Terminating flag just indicates that this is the final > PushTelemetryRequest > from this client. Any subsequent request will be rejected. I think this flag > should remain. > > 119. Good catch. This was actually contradicting another part of the KIP. The > current behaviour > is indeed preserved. If the broker doesn’t have a client metrics receiver > plugin, the new RPCs > in this KIP are “turned off” and not reported in ApiVersionsResponse. The > client will not > attempt to push metrics. > > 120. The error handling table lists the error codes for > PushTelemetryResponse. I’ve added one > but it looked good to me. GetTelemetrySubscriptions doesn’t have any error > codes, since the > situation in which the client telemetry is not supported is handled by the > RPCs not being offered > by the broker. > > 121. Again, I think it’s probably a mistake to be specific about threading. > Removed. > > 122. Good catch. For DescribeConfigs, the ACL operation should be > “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be > “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER > resource. > > Thanks for the detailed review. > > Thanks, > Andrew > >> >> 110. Another potential motivation is the multiple clients support. Some of >> the places may not have good monitoring support for non-java clients. >> >> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name and >> replace '-' with '_' for attributes. Why is the inconsistency? >> >> 112. OTLP specification: Page is not found from the link. >> >> 113. "Defining standard and required metrics makes the monitoring and >> troubleshooting of clients from various client types ": Incomplete sentence. >> >> 114. standard/required metrics >> 114.1 Do other clients need to implement those metrics with the exact same >> names? >> 114.2 What happens if some of those metrics are missing from a client? >> >> 115. "KafkaConsumer: both the "heart beat" and application threads": We >> have an ongoing effort to refactor the consumer threading model ( >> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design). >> Once this is done, PRC requests will only be made from the background >> thread. Should this KIP follow the new model only? >> >> 116. 'The metrics should contain the reason for the client termination by >> including the client.terminating metric with the label “reason” ...'. Hmm, >> are we introducing a new metric client.terminating? If so, that needs to be >> explicitly listed. >> >> 117. "As the metrics plugin may need to add additional metrics on top of >> this the generic metrics receiver in the broker will not add these labels >> but rely on the plugins to do so," The sentence doesn't read well. >> >> 118. "it is possible for the client to send at most one accepted >> out-of-profile per connection before the rate-limiter kicks in": If we do >> this, do we still need the Terminating flag in PushTelemetryRequestV0? >> >> 119. "If there is no client metrics receiver plugin configured on the >> broker, it will respond to GetTelemetrySubscriptionsRequest with >> RequestedMetrics set to Null and a -1 SubscriptionId. The client should >> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs has >> expired. This allows the metrics receiver to be enabled or disabled without >> having to restart the broker or reset the client connection." >> "no client metrics receiver plugin configured" is defined by no metric >> reporter implementing the ClientTelemetry interface, right? In that case, >> it would be useful to avoid the clients sending >> GetTelemetrySubscriptionsRequest periodically to preserve the current >> behavior. >> >> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0: Could >> we list error codes for each? >> >> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may be >> called from the request handling thread": Where else can this method be >> called? >> >> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the ACL? >> >> Thanks, >> >> Jun >> >> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield < >> andrew_schofield_j...@outlook.com> wrote: >> >>> Hi Milind, >>> Thanks for your question. >>> >>> On reflection, I agree that INVALID_RECORD is most likely to be caused by a >>> problem in the serialization in the client. I have changed the client >>> action in this case >>> to “Log an error and stop pushing metrics”. >>> >>> I have updated the KIP text accordingly. >>> >>> Thanks, >>> Andrew >>> >>>> On 31 Jul 2023, at 12:09, Milind Luthra <milut...@confluent.io.INVALID> >>> wrote: >>>> >>>> Hi Andrew, >>>> Thanks for the clarifications. >>>> >>>> About 2b: >>>> In case a client has a bug while serializing, it might be difficult for >>> the >>>> client to recover from that without code changes. In that, it might be >>> good >>>> to just log the INVALID_RECORD as an error, and treat the error as fatal >>>> for the client (only fatal in terms of sending the metrics, the client >>> can >>>> keep functioning otherwise). What do you think? >>>> >>>> Thanks >>>> Milind >>>> >>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < >>>> andrew_schofield_j...@outlook.com> wrote: >>>> >>>>> Hi Milind, >>>>> Thanks for your questions about the KIP. >>>>> >>>>> 1) I did some archaeology and looked at historical versions of the KIP. >>> I >>>>> think this is >>>>> just a mistake. 5 minutes is the default metric push interval. 30 >>> minutes >>>>> is a mystery >>>>> to me. I’ve updated the KIP. >>>>> >>>>> 2) I think there are two situations in which INVALID_RECORD might occur. >>>>> a) The client might perhaps be using a content-type that the broker does >>>>> not support. >>>>> The KIP mentions content-type as a future extension, but there’s only >>> one >>>>> supported >>>>> to start with. Until we have multiple content-types, this seems out of >>>>> scope. I think a >>>>> future KIP would add another error code for this. >>>>> b) The client might perhaps have a bug which means the metrics payload >>> is >>>>> malformed. >>>>> Logging a warning and attempting the next metrics push on the push >>>>> interval seems >>>>> appropriate. >>>>> >>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate >>>>> GetTelemetrySubscriptionsRequest. >>>>> >>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps >>>>> a situation in which a broker sends a compression type in a >>>>> GetTelemetrySubscriptionsResponse >>>>> which is subsequently not supported when its used with a >>>>> PushTelemetryRequest. >>>>> We do want the client to have the opportunity to get an up-to-date list >>> of >>>>> supported >>>>> compression types. I think an immediate GetTelemetrySubscriptionsRequest >>>>> is appropriate. >>>>> >>>>> 3) If a client attempts a subsequent handshake with a Null >>>>> ClientInstanceId, the >>>>> receiving broker may not already know the client's existing >>>>> ClientInstanceId. If the >>>>> receiving broker knows the existing ClientInstanceId, it simply responds >>>>> the existing >>>>> value back to the client. If it does not know the existing >>>>> ClientInstanceId, it will create >>>>> a new client instance ID and respond with that. >>>>> >>>>> I will update the KIP with these clarifications. >>>>> >>>>> Thanks, >>>>> Andrew >>>>> >>>>>> On 17 Jul 2023, at 14:21, Milind Luthra <milut...@confluent.io.INVALID >>>> >>>>> wrote: >>>>>> >>>>>> Hi Andrew, thanks for this KIP. >>>>>> >>>>>> I had a few questions regarding the "Error handling" section. >>>>>> >>>>>> 1. It mentions that "The 5 and 30 minute retries are to eventually >>>>> trigger >>>>>> a retry and avoid having to restart clients if the cluster metrics >>>>>> configuration is disabled temporarily, e.g., by operator error, rolling >>>>>> upgrades, etc." >>>>>> But this 30 min interval isn't mentioned anywhere else. What is it >>>>>> referring to? >>>>>> >>>>>> 2. For the actual errors: >>>>>> INVALID_RECORD : The action required is to "Log a warning to the >>>>>> application and schedule the next GetTelemetrySubscriptionsRequest to 5 >>>>>> minutes". Why is this 5 minutes, and not something like PushIntervalMs? >>>>> And >>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in this >>>>>> case, if the serialization is broken? >>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to >>> confirm, >>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled immediately >>>>>> after the PushTelemetry response, correct? >>>>>> >>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the >>>>>> ClientInstanceId returned in the first response, regardless of broker": >>>>>> Will a broker error be returned in case some implementation of this KIP >>>>>> violates this accidentally and sends a request with ClientInstanceId = >>>>> Null >>>>>> even when it's been obtained already? Or will a new ClientInstanceId be >>>>>> returned without an error? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < >>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I would like to start a new discussion thread on KIP-714: Client >>> metrics >>>>>>> and observability. >>>>>>> >>>>>>> >>>>>>> >>>>> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability >>>>>>> >>>>>>> I have edited the proposal significantly to reduce the scope. The >>>>> overall >>>>>>> mechanism for client metric subscriptions is unchanged, but the >>>>>>> KIP is now based on the existing client metrics, rather than >>> introducing >>>>>>> new metrics. The purpose remains helping cluster operators >>>>>>> investigate performance problems experienced by clients without >>>>> requiring >>>>>>> changes to the client application code or configuration. >>>>>>> >>>>>>> Thanks, >>>>>>> Andrew > >