Hi Jun, Thanks for your comments. 130. As Matthias described, and I am adding to the KIP, the `KafkaStreams#clientInstanceIds` method is only permitted when the state is RUNNING or REBALANCING. Also, clients can be added dynamically so the maps might change over time. If it’s in a permitted state, the method is prepared to wait up to the supplied timeout to get the client instance ids. It does not return a partial result - it returns a result or fails.
131. I’ve refactored the `ClientsInstanceIds` object and the global consumer is now part of the map of consumers. There is no need for the Optional any longer. I’ve also renamed it `ClientInstanceIds`. 132. My reading of `(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*)` is that It does not support every request type - it supports Produce, FetchConsumer and FetchFollower. Consequently, I think the ClientMetricsSubscriptionRequestCount is not instantly obsolete. If I’ve misunderstood, please let me know. Thanks, Andrew > On 12 Oct 2023, at 01:07, Jun Rao <j...@confluent.io.INVALID> wrote: > > Hi, Andrew, > > Thanks for the updated KIP. Just a few more minor comments. > > 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all > consumer/producer/adminClient instances to be initialized? Are all those > instances created during KafkaStreams initialization? > > 131. Why does globalConsumerInstanceId() return Optional<String> while > other consumer instances don't return Optional? > > 132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a > set of generic metrics > (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that > report Request rate for every request type? > > Thanks, > > Jun > > On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org> wrote: > >> Thanks! >> >> On 10/10/23 11:31 PM, Andrew Schofield wrote: >>> Matthias, >>> Yes, I think that’s a sensible way forward and the interface you propose >> looks good. I’ll update the KIP accordingly. >>> >>> Thanks, >>> Andrew >>> >>>> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote: >>>> >>>> Andrew, >>>> >>>> yes I would like to get this change into KIP-714 right way. Seems to be >> important, as we don't know if/when a follow-up KIP for Kafka Streams would >> land. >>>> >>>> I was also thinking (and discussed with a few others) how to expose it, >> and we would propose the following: >>>> >>>> We add a new method to `KafkaStreams` class: >>>> >>>> public ClientsInstanceIds clientsInstanceIds(Duration timeout); >>>> >>>> The returned object is like below: >>>> >>>> public class ClientsInstanceIds { >>>> // we only have a single admin client per KS instance >>>> String adminInstanceId(); >>>> >>>> // we only have a single global consumer per KS instance (if any) >>>> // Optional<> because we might not have global-thread >>>> Optional<String> globalConsumerInstanceId(); >>>> >>>> // return a <threadKey -> ClientInstanceId> mapping >>>> // for the underlying (restore-)consumers/producers >>>> Map<String, String> mainConsumerInstanceIds(); >>>> Map<String, String> restoreConsumerInstanceIds(); >>>> Map<String, String> producerInstanceIds(); >>>> } >>>> >>>> For the `threadKey`, we would use some pattern like this: >>>> >>>> [Stream|StateUpdater]Thread-<threadIdx> >>>> >>>> >>>> Would this work from your POV? >>>> >>>> >>>> >>>> -Matthias >>>> >>>> >>>> On 10/9/23 2:15 AM, Andrew Schofield wrote: >>>>> Hi Matthias, >>>>> Good point. Makes sense to me. >>>>> Is this something that can also be included in the proposed Kafka >> Streams follow-on KIP, or would you prefer that I add it to KIP-714? >>>>> I have a slight preference for the former to put all of the KS >> enhancements into a separate KIP. >>>>> Thanks, >>>>> Andrew >>>>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote: >>>>>> >>>>>> Thanks Andrew. SGTM. >>>>>> >>>>>> One point you did not address is the idea to add a method to >> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be >> added to consumer/producer/admin clients. >>>>>> >>>>>> Without addressing this, Kafka Streams users won't have a way to get >> the assigned `instanceId` of the internally created clients, and thus it >> would be very difficult for them to know which metrics that the broker >> receives belong to a Kafka Streams app. It seems they would only find the >> `instanceIds` in the log4j output if they enable client logging? >>>>>> >>>>>> Of course, because there is multiple clients inside Kafka Streams, >> the return type cannot be an single "String", but must be some some complex >> data structure -- we could either add a new class, or return a >> Map<String,String> using a client key that maps to the `instanceId`. >>>>>> >>>>>> For example we could use the following key: >>>>>> >>>>>> [Global]StreamThread[-<threadIndex>][-restore][consumer|producer] >>>>>> >>>>>> (Of course, only the valid combination.) >>>>>> >>>>>> Or maybe even better, we might want to return a `Future` because >> collection all the `instanceId` might be a blocking all on each client? I >> have already a few idea how it could be implemented but I don't think it >> must be discussed on the KIP, as it's an implementation detail. >>>>>> >>>>>> Thoughts? >>>>>> >>>>>> >>>>>> -Matthias >>>>>> >>>>>> On 10/6/23 4:21 AM, Andrew Schofield wrote: >>>>>>> Hi Matthias, >>>>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka >> Streams makes sense. This KIP currently has made a bit >>>>>>> of an effort to embrace KS, but it’s not enough by a long way. >>>>>>> I have removed `application.id <http://application.id/>`. This >> should be done properly in the follow-up KIP. I don’t believe there’s a >> downside to >>>>>>> removing it from this KIP. >>>>>>> I have reworded the statement about temporarily. In practice, the >> implementation of this KIP that’s going on while the voting >>>>>>> progresses happens to use delta temporality, but that’s an >> implementation detail. Supporting clients must support both >>>>>>> temporalities. >>>>>>> I thought about exposing the client instance ID as a metric, but >> non-numeric metrics are not usual practice and tools >>>>>>> do not universally support them. I don’t think the KIP is improved >> by adding one now. >>>>>>> I have also added constants for the various Config classes for >> ENABLE_METRICS_PUSH_CONFIG, including to >>>>>>> StreamsConfig. It’s best to be explicit about this. >>>>>>> Thanks, >>>>>>> Andrew >>>>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I did not pay attention to this KIP in the past; seems it was >> on-hold for a while. >>>>>>>> >>>>>>>> Overall it sounds very useful, and I think we should extend this >> with a follow up KIP for Kafka Streams. What is unclear to me at this point >> is the statement: >>>>>>>> >>>>>>>>> Kafka Streams applications have an application.id configured and >> this identifier should be included as the application_id metrics label. >>>>>>>> >>>>>>>> The `application.id` is currently only used as the (main) >> consumer's `group.id` (and is part of an auto-generated `client.id` if >> the user does not set one). >>>>>>>> >>>>>>>> This comment related to: >>>>>>>> >>>>>>>>> The following labels should be added by the client as appropriate >> before metrics are pushed. >>>>>>>> >>>>>>>> Given that Kafka Streams uses the consumer/producer/admin client as >> "black boxes", a client does at this point not know that it's part of a >> Kafka Streams application, and thus, it won't be able to attach any such >> label to the metrics it sends. (Also producer and admin don't even know the >> value of `application.id` -- only the (main) consumer, indirectly via ` >> group.id`, but also restore and global consumer don't know it, because >> they don't have `group.id` set). >>>>>>>> >>>>>>>> While I am totally in favor of the proposal, I am wondering how we >> intent to implement it in clean way? Or would we do ok to have some >> internal client APIs that KS can use to "register" itself with the client? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> While clients must support both temporalities, the broker will >> initially only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True >>>>>>>> >>>>>>>> Not sure if I can follow. How make the decision about DELTA or >> CUMULATIVE metrics? Should the broker side plugin not decide what metrics >> it what to receive in which form? So what does "initially" mean -- the >> broker won't ship with a default plugin implementation? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> The following method is added to the Producer, Consumer, and Admin >> client interfaces: >>>>>>>> >>>>>>>> Should we add anything to Kafka Streams to expose the underlying >> clients' assigned client-instance-ids programmatically? I am also wondering >> if clients should report their assigned client-instance-ids as metrics >> itself (for this case, Kafka Streams won't need to do anything, because we >> already expose all client metrics). >>>>>>>> >>>>>>>> If we add anything programmatic, we need to make it simple, given >> that Kafka Streams has many clients per `StreamThread` and may have >> multiple threads. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> enable.metrics.push >>>>>>>> It might be worth to add this to `StreamsConfig`, too? It set via >> StreamsConfig, we would forward it to all clients automatically. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -Matthias >>>>>>>> >>>>>>>> >>>>>>>> On 9/29/23 5:45 PM, David Jacot wrote: >>>>>>>>> Hi Andrew, >>>>>>>>> Thanks for driving this one. I haven't read all the KIP yet but I >> already >>>>>>>>> have an initial question. In the Threading section, it is written >>>>>>>>> "KafkaConsumer: the "background" thread (based on the consumer >> threading >>>>>>>>> refactor which is underway)". If I understand this correctly, it >> means >>>>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I >> correct? >>>>>>>>> Cheers, >>>>>>>>> David >>>>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield < >>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>> Hi Philip, >>>>>>>>>> No, I do not think it should actively search for a broker that >> supports >>>>>>>>>> the new >>>>>>>>>> RPCs. In general, either all of the brokers or none of the >> brokers will >>>>>>>>>> support it. >>>>>>>>>> In the window, where the cluster is being upgraded or client >> telemetry is >>>>>>>>>> being >>>>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too >> much effort >>>>>>>>>> into >>>>>>>>>> this mixed scenario. As the client finds brokers which support >> the new >>>>>>>>>> RPCs, >>>>>>>>>> it can begin to follow the KIP-714 mechanism. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Andrew >>>>>>>>>> >>>>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> >> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Andrew - >>>>>>>>>>> >>>>>>>>>>> Question on top of your answers: Do you think the client should >> actively >>>>>>>>>>> search for a broker that supports this RPC? As previously >> mentioned, the >>>>>>>>>>> broker uses the leastLoadedNode to find its first connection (am >>>>>>>>>>> I correct?), and what if that broker doesn't support the metric >> push? >>>>>>>>>>> >>>>>>>>>>> P >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield < >>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Kirk, >>>>>>>>>>>> Thanks for your question. You are correct that the presence or >> absence >>>>>>>>>> of >>>>>>>>>>>> the new RPCs in the >>>>>>>>>>>> ApiVersionsResponse tells the client whether to request the >> telemetry >>>>>>>>>>>> subscriptions and push >>>>>>>>>>>> metrics. >>>>>>>>>>>> >>>>>>>>>>>> This is of course tricky in practice. It would be conceivable, >> as a >>>>>>>>>>>> cluster is upgraded to AK 3.7 >>>>>>>>>>>> or as a client metrics receiver plugin is deployed across the >> cluster, >>>>>>>>>>>> that a client connects to some >>>>>>>>>>>> brokers that support the new RPCs and some that do not. >>>>>>>>>>>> >>>>>>>>>>>> Here’s my suggestion: >>>>>>>>>>>> * If a client is not connected to any brokers that support in >> the new >>>>>>>>>>>> RPCs, it cannot push metrics. >>>>>>>>>>>> * If a client is only connected to brokers that support the new >> RPCs, it >>>>>>>>>>>> will use the new RPCs in >>>>>>>>>>>> accordance with the KIP. >>>>>>>>>>>> * If a client is connected to some brokers that support the new >> RPCs and >>>>>>>>>>>> some that do not, it will >>>>>>>>>>>> use the new RPCs with the supporting subset of brokers in >> accordance >>>>>>>>>> with >>>>>>>>>>>> the KIP. >>>>>>>>>>>> >>>>>>>>>>>> Comments? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Andrew >>>>>>>>>>>> >>>>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Andrew/Jun, >>>>>>>>>>>>> >>>>>>>>>>>>> I want to make sure I understand question/comment #119… In the >> case >>>>>>>>>>>> where a cluster without a metrics client receiver is later >> reconfigured >>>>>>>>>> and >>>>>>>>>>>> restarted to include a metrics client receiver, do we want the >> client to >>>>>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s >> response >>>>>>>>>> to >>>>>>>>>>>> question #119, it sounds like we’re using the presence/absence >> of the >>>>>>>>>>>> relevant RPCs in ApiVersionsResponse as the >> to-push-or-not-to-push >>>>>>>>>>>> indicator. Do I have that correct? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Kirk >>>>>>>>>>>>> >>>>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield < >>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Jun, >>>>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify >> where >>>>>>>>>>>> necessary. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the >>>>>>>>>>>> replacement of >>>>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP >> guidelines. I >>>>>>>>>>>> think it’s a bit >>>>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a >> namespace >>>>>>>>>> and a >>>>>>>>>>>>>> multi-word component. If it was “client.id” then “client” >> would be a >>>>>>>>>>>> namespace with >>>>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it >> was >>>>>>>>>>>> intentional, but debatable. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 112. Thanks. The link target moved. Fixed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 113. Thanks. Fixed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it >> should use >>>>>>>>>> the >>>>>>>>>>>> exact same >>>>>>>>>>>>>> name. If a standard metric doesn’t make sense for a client, >> then it >>>>>>>>>> can >>>>>>>>>>>> omit that metric. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For a required metric, the situation is stronger. All clients >> must >>>>>>>>>>>> implement these >>>>>>>>>>>>>> metrics with these names in order to implement the KIP. But >> the >>>>>>>>>>>> required metrics >>>>>>>>>>>>>> are essentially the number of connections and the request >> latency, >>>>>>>>>>>> which do not >>>>>>>>>>>>>> reference the underlying implementation of the client (which >>>>>>>>>>>> producer.record.queue.time.max >>>>>>>>>>>>>> of course does). >>>>>>>>>>>>>> >>>>>>>>>>>>>> I suppose someone might build a producer-only client that >> didn’t have >>>>>>>>>>>> consumer metrics. >>>>>>>>>>>>>> In this case, the consumer metrics would conceptually have >> the value 0 >>>>>>>>>>>> and would not >>>>>>>>>>>>>> need to be sent to the broker. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 114.2. If a client does not implement some metrics, they will >> not be >>>>>>>>>>>> available for >>>>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to >> combine >>>>>>>>>>>> metrics from lots >>>>>>>>>>>>>> different clients less complete. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 115. I think it was probably a mistake to be so specific about >>>>>>>>>>>> threading in this KIP. >>>>>>>>>>>>>> When the consumer threading refactor is complete, of course, >> it would >>>>>>>>>>>> do the appropriate >>>>>>>>>>>>>> equivalent. I’ve added a clarification and massively >> simplified this >>>>>>>>>>>> section. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 116. I removed “client.terminating”. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 117. Yes. Horrid. Fixed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 118. The Terminating flag just indicates that this is the >> final >>>>>>>>>>>> PushTelemetryRequest >>>>>>>>>>>>>> from this client. Any subsequent request will be rejected. I >> think >>>>>>>>>> this >>>>>>>>>>>> flag should remain. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 119. Good catch. This was actually contradicting another part >> of the >>>>>>>>>>>> KIP. The current behaviour >>>>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client >> metrics >>>>>>>>>>>> receiver plugin, the new RPCs >>>>>>>>>>>>>> in this KIP are “turned off” and not reported in >> ApiVersionsResponse. >>>>>>>>>>>> The client will not >>>>>>>>>>>>>> attempt to push metrics. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 120. The error handling table lists the error codes for >>>>>>>>>>>> PushTelemetryResponse. I’ve added one >>>>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t >> have any >>>>>>>>>>>> error codes, since the >>>>>>>>>>>>>> situation in which the client telemetry is not supported is >> handled by >>>>>>>>>>>> the RPCs not being offered >>>>>>>>>>>>>> by the broker. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific >> about >>>>>>>>>>>> threading. Removed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation >> should be >>>>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation >> should be >>>>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the >> CLUSTER >>>>>>>>>>>>>> resource. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for the detailed review. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 110. Another potential motivation is the multiple clients >> support. >>>>>>>>>>>> Some of >>>>>>>>>>>>>>> the places may not have good monitoring support for non-java >> clients. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for >> metric name >>>>>>>>>> and >>>>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the >> inconsistency? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 113. "Defining standard and required metrics makes the >> monitoring and >>>>>>>>>>>>>>> troubleshooting of clients from various client types ": >> Incomplete >>>>>>>>>>>> sentence. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 114. standard/required metrics >>>>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with >> the exact >>>>>>>>>>>> same >>>>>>>>>>>>>>> names? >>>>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing from >> a >>>>>>>>>> client? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application >> threads": >>>>>>>>>> We >>>>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading >> model ( >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design >>>>>>>>>>>> ). >>>>>>>>>>>>>>> Once this is done, PRC requests will only be made from the >> background >>>>>>>>>>>>>>> thread. Should this KIP follow the new model only? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 116. 'The metrics should contain the reason for the client >>>>>>>>>> termination >>>>>>>>>>>> by >>>>>>>>>>>>>>> including the client.terminating metric with the label >> “reason” ...'. >>>>>>>>>>>> Hmm, >>>>>>>>>>>>>>> are we introducing a new metric client.terminating? If so, >> that needs >>>>>>>>>>>> to be >>>>>>>>>>>>>>> explicitly listed. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional >> metrics on top >>>>>>>>>>>> of >>>>>>>>>>>>>>> this the generic metrics receiver in the broker will not add >> these >>>>>>>>>>>> labels >>>>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't read >> well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 118. "it is possible for the client to send at most one >> accepted >>>>>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks >> in": If >>>>>>>>>> we >>>>>>>>>>>> do >>>>>>>>>>>>>>> this, do we still need the Terminating flag in >>>>>>>>>> PushTelemetryRequestV0? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin >> configured on the >>>>>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest >> with >>>>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The >> client >>>>>>>>>> should >>>>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the >> PushIntervalMs >>>>>>>>>>>> has >>>>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or >> disabled >>>>>>>>>>>> without >>>>>>>>>>>>>>> having to restart the broker or reset the client connection." >>>>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined by >> no >>>>>>>>>> metric >>>>>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right? >> In that >>>>>>>>>>>> case, >>>>>>>>>>>>>>> it would be useful to avoid the clients sending >>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve >> the current >>>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and >> PushTelemetryRequestV0: >>>>>>>>>>>> Could >>>>>>>>>>>>>>> we list error codes for each? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This >> method may >>>>>>>>>>>> be >>>>>>>>>>>>>>> called from the request handling thread": Where else can >> this method >>>>>>>>>> be >>>>>>>>>>>>>>> called? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we >> changing the >>>>>>>>>>>> ACL? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jun >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield < >>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Milind, >>>>>>>>>>>>>>>> Thanks for your question. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely >> to be >>>>>>>>>>>> caused by a >>>>>>>>>>>>>>>> problem in the serialization in the client. I have changed >> the >>>>>>>>>> client >>>>>>>>>>>>>>>> action in this case >>>>>>>>>>>>>>>> to “Log an error and stop pushing metrics”. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have updated the KIP text accordingly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra >>>>>>>>>>>> <milut...@confluent.io.INVALID> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>> Thanks for the clarifications. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> About 2b: >>>>>>>>>>>>>>>>> In case a client has a bug while serializing, it might be >> difficult >>>>>>>>>>>> for >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> client to recover from that without code changes. In that, >> it might >>>>>>>>>>>> be >>>>>>>>>>>>>>>> good >>>>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the >> error as >>>>>>>>>>>> fatal >>>>>>>>>>>>>>>>> for the client (only fatal in terms of sending the >> metrics, the >>>>>>>>>>>> client >>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>> keep functioning otherwise). What do you think? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> Milind >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < >>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Milind, >>>>>>>>>>>>>>>>>> Thanks for your questions about the KIP. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical >> versions of the >>>>>>>>>>>> KIP. >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>> think this is >>>>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push >> interval. 30 >>>>>>>>>>>>>>>> minutes >>>>>>>>>>>>>>>>>> is a mystery >>>>>>>>>>>>>>>>>> to me. I’ve updated the KIP. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2) I think there are two situations in which >> INVALID_RECORD might >>>>>>>>>>>> occur. >>>>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that >> the >>>>>>>>>> broker >>>>>>>>>>>> does >>>>>>>>>>>>>>>>>> not support. >>>>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but >> there’s >>>>>>>>>>>> only >>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>> supported >>>>>>>>>>>>>>>>>> to start with. Until we have multiple content-types, this >> seems >>>>>>>>>> out >>>>>>>>>>>> of >>>>>>>>>>>>>>>>>> scope. I think a >>>>>>>>>>>>>>>>>> future KIP would add another error code for this. >>>>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the >> metrics >>>>>>>>>>>> payload >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>> malformed. >>>>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push on >> the push >>>>>>>>>>>>>>>>>> interval seems >>>>>>>>>>>>>>>>>> appropriate. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making >> an >>>>>>>>>>>> immediate >>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client >> bug or >>>>>>>>>>>> perhaps >>>>>>>>>>>>>>>>>> a situation in which a broker sends a compression type in >> a >>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse >>>>>>>>>>>>>>>>>> which is subsequently not supported when its used with a >>>>>>>>>>>>>>>>>> PushTelemetryRequest. >>>>>>>>>>>>>>>>>> We do want the client to have the opportunity to get an >> up-to-date >>>>>>>>>>>> list >>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>> supported >>>>>>>>>>>>>>>>>> compression types. I think an immediate >>>>>>>>>>>> GetTelemetrySubscriptionsRequest >>>>>>>>>>>>>>>>>> is appropriate. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null >>>>>>>>>>>>>>>>>> ClientInstanceId, the >>>>>>>>>>>>>>>>>> receiving broker may not already know the client's >> existing >>>>>>>>>>>>>>>>>> ClientInstanceId. If the >>>>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it >> simply >>>>>>>>>>>> responds >>>>>>>>>>>>>>>>>> the existing >>>>>>>>>>>>>>>>>> value back to the client. If it does not know the existing >>>>>>>>>>>>>>>>>> ClientInstanceId, it will create >>>>>>>>>>>>>>>>>> a new client instance ID and respond with that. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I will update the KIP with these clarifications. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra >>>>>>>>>>>> <milut...@confluent.io.INVALID >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling" >> section. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to >>>>>>>>>> eventually >>>>>>>>>>>>>>>>>> trigger >>>>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the >> cluster >>>>>>>>>> metrics >>>>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by operator >> error, >>>>>>>>>>>> rolling >>>>>>>>>>>>>>>>>>> upgrades, etc." >>>>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. >> What is >>>>>>>>>> it >>>>>>>>>>>>>>>>>>> referring to? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2. For the actual errors: >>>>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a >> warning to the >>>>>>>>>>>>>>>>>>> application and schedule the next >>>>>>>>>> GetTelemetrySubscriptionsRequest >>>>>>>>>>>> to 5 >>>>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like >>>>>>>>>>>> PushIntervalMs? >>>>>>>>>>>>>>>>>> And >>>>>>>>>>>>>>>>>>> also, why are we scheduling a >> GetTelemetrySubscriptionsRequest in >>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>> case, if the serialization is broken? >>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : >> just to >>>>>>>>>>>>>>>> confirm, >>>>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be >> scheduled >>>>>>>>>>>> immediately >>>>>>>>>>>>>>>>>>> after the PushTelemetry response, correct? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests >> must include >>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response, >> regardless of >>>>>>>>>>>> broker": >>>>>>>>>>>>>>>>>>> Will a broker error be returned in case some >> implementation of >>>>>>>>>>>> this KIP >>>>>>>>>>>>>>>>>>> violates this accidentally and sends a request with >>>>>>>>>>>> ClientInstanceId = >>>>>>>>>>>>>>>>>> Null >>>>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new >>>>>>>>>>>> ClientInstanceId be >>>>>>>>>>>>>>>>>>> returned without an error? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < >>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on >> KIP-714: Client >>>>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>>>>>> and observability. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the >> scope. >>>>>>>>>> The >>>>>>>>>>>>>>>>>> overall >>>>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is unchanged, >> but the >>>>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics, rather >> than >>>>>>>>>>>>>>>> introducing >>>>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster >> operators >>>>>>>>>>>>>>>>>>>> investigate performance problems experienced by clients >> without >>>>>>>>>>>>>>>>>> requiring >>>>>>>>>>>>>>>>>>>> changes to the client application code or configuration. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Andrew >>> >>> >>