Matthias, Yes, I think that’s a sensible way forward and the interface you propose looks good. I’ll update the KIP accordingly.
Thanks, Andrew > On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote: > > Andrew, > > yes I would like to get this change into KIP-714 right way. Seems to be > important, as we don't know if/when a follow-up KIP for Kafka Streams would > land. > > I was also thinking (and discussed with a few others) how to expose it, and > we would propose the following: > > We add a new method to `KafkaStreams` class: > > public ClientsInstanceIds clientsInstanceIds(Duration timeout); > > The returned object is like below: > > public class ClientsInstanceIds { > // we only have a single admin client per KS instance > String adminInstanceId(); > > // we only have a single global consumer per KS instance (if any) > // Optional<> because we might not have global-thread > Optional<String> globalConsumerInstanceId(); > > // return a <threadKey -> ClientInstanceId> mapping > // for the underlying (restore-)consumers/producers > Map<String, String> mainConsumerInstanceIds(); > Map<String, String> restoreConsumerInstanceIds(); > Map<String, String> producerInstanceIds(); > } > > For the `threadKey`, we would use some pattern like this: > > [Stream|StateUpdater]Thread-<threadIdx> > > > Would this work from your POV? > > > > -Matthias > > > On 10/9/23 2:15 AM, Andrew Schofield wrote: >> Hi Matthias, >> Good point. Makes sense to me. >> Is this something that can also be included in the proposed Kafka Streams >> follow-on KIP, or would you prefer that I add it to KIP-714? >> I have a slight preference for the former to put all of the KS enhancements >> into a separate KIP. >> Thanks, >> Andrew >>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote: >>> >>> Thanks Andrew. SGTM. >>> >>> One point you did not address is the idea to add a method to `KafkaStreams` >>> similar to the proposed `clientInstanceId()` that will be added to >>> consumer/producer/admin clients. >>> >>> Without addressing this, Kafka Streams users won't have a way to get the >>> assigned `instanceId` of the internally created clients, and thus it would >>> be very difficult for them to know which metrics that the broker receives >>> belong to a Kafka Streams app. It seems they would only find the >>> `instanceIds` in the log4j output if they enable client logging? >>> >>> Of course, because there is multiple clients inside Kafka Streams, the >>> return type cannot be an single "String", but must be some some complex >>> data structure -- we could either add a new class, or return a >>> Map<String,String> using a client key that maps to the `instanceId`. >>> >>> For example we could use the following key: >>> >>> [Global]StreamThread[-<threadIndex>][-restore][consumer|producer] >>> >>> (Of course, only the valid combination.) >>> >>> Or maybe even better, we might want to return a `Future` because collection >>> all the `instanceId` might be a blocking all on each client? I have already >>> a few idea how it could be implemented but I don't think it must be >>> discussed on the KIP, as it's an implementation detail. >>> >>> Thoughts? >>> >>> >>> -Matthias >>> >>> On 10/6/23 4:21 AM, Andrew Schofield wrote: >>>> Hi Matthias, >>>> Thanks for your comments. I agree that a follow-up KIP for Kafka Streams >>>> makes sense. This KIP currently has made a bit >>>> of an effort to embrace KS, but it’s not enough by a long way. >>>> I have removed `application.id <http://application.id/>`. This should be >>>> done properly in the follow-up KIP. I don’t believe there’s a downside to >>>> removing it from this KIP. >>>> I have reworded the statement about temporarily. In practice, the >>>> implementation of this KIP that’s going on while the voting >>>> progresses happens to use delta temporality, but that’s an implementation >>>> detail. Supporting clients must support both >>>> temporalities. >>>> I thought about exposing the client instance ID as a metric, but >>>> non-numeric metrics are not usual practice and tools >>>> do not universally support them. I don’t think the KIP is improved by >>>> adding one now. >>>> I have also added constants for the various Config classes for >>>> ENABLE_METRICS_PUSH_CONFIG, including to >>>> StreamsConfig. It’s best to be explicit about this. >>>> Thanks, >>>> Andrew >>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I did not pay attention to this KIP in the past; seems it was on-hold for >>>>> a while. >>>>> >>>>> Overall it sounds very useful, and I think we should extend this with a >>>>> follow up KIP for Kafka Streams. What is unclear to me at this point is >>>>> the statement: >>>>> >>>>>> Kafka Streams applications have an application.id configured and this >>>>>> identifier should be included as the application_id metrics label. >>>>> >>>>> The `application.id` is currently only used as the (main) consumer's >>>>> `group.id` (and is part of an auto-generated `client.id` if the user does >>>>> not set one). >>>>> >>>>> This comment related to: >>>>> >>>>>> The following labels should be added by the client as appropriate before >>>>>> metrics are pushed. >>>>> >>>>> Given that Kafka Streams uses the consumer/producer/admin client as >>>>> "black boxes", a client does at this point not know that it's part of a >>>>> Kafka Streams application, and thus, it won't be able to attach any such >>>>> label to the metrics it sends. (Also producer and admin don't even know >>>>> the value of `application.id` -- only the (main) consumer, indirectly via >>>>> `group.id`, but also restore and global consumer don't know it, because >>>>> they don't have `group.id` set). >>>>> >>>>> While I am totally in favor of the proposal, I am wondering how we intent >>>>> to implement it in clean way? Or would we do ok to have some internal >>>>> client APIs that KS can use to "register" itself with the client? >>>>> >>>>> >>>>> >>>>>> While clients must support both temporalities, the broker will initially >>>>>> only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True >>>>> >>>>> Not sure if I can follow. How make the decision about DELTA or CUMULATIVE >>>>> metrics? Should the broker side plugin not decide what metrics it what to >>>>> receive in which form? So what does "initially" mean -- the broker won't >>>>> ship with a default plugin implementation? >>>>> >>>>> >>>>> >>>>>> The following method is added to the Producer, Consumer, and Admin >>>>>> client interfaces: >>>>> >>>>> Should we add anything to Kafka Streams to expose the underlying clients' >>>>> assigned client-instance-ids programmatically? I am also wondering if >>>>> clients should report their assigned client-instance-ids as metrics >>>>> itself (for this case, Kafka Streams won't need to do anything, because >>>>> we already expose all client metrics). >>>>> >>>>> If we add anything programmatic, we need to make it simple, given that >>>>> Kafka Streams has many clients per `StreamThread` and may have multiple >>>>> threads. >>>>> >>>>> >>>>> >>>>>> enable.metrics.push >>>>> It might be worth to add this to `StreamsConfig`, too? It set via >>>>> StreamsConfig, we would forward it to all clients automatically. >>>>> >>>>> >>>>> >>>>> >>>>> -Matthias >>>>> >>>>> >>>>> On 9/29/23 5:45 PM, David Jacot wrote: >>>>>> Hi Andrew, >>>>>> Thanks for driving this one. I haven't read all the KIP yet but I already >>>>>> have an initial question. In the Threading section, it is written >>>>>> "KafkaConsumer: the "background" thread (based on the consumer threading >>>>>> refactor which is underway)". If I understand this correctly, it means >>>>>> that KIP-714 won't work if the "old consumer" is used. Am I correct? >>>>>> Cheers, >>>>>> David >>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield < >>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>> Hi Philip, >>>>>>> No, I do not think it should actively search for a broker that supports >>>>>>> the new >>>>>>> RPCs. In general, either all of the brokers or none of the brokers will >>>>>>> support it. >>>>>>> In the window, where the cluster is being upgraded or client telemetry >>>>>>> is >>>>>>> being >>>>>>> enabled, there might be a mixed situation. I wouldn’t put too much >>>>>>> effort >>>>>>> into >>>>>>> this mixed scenario. As the client finds brokers which support the new >>>>>>> RPCs, >>>>>>> it can begin to follow the KIP-714 mechanism. >>>>>>> >>>>>>> Thanks, >>>>>>> Andrew >>>>>>> >>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> wrote: >>>>>>>> >>>>>>>> Hi Andrew - >>>>>>>> >>>>>>>> Question on top of your answers: Do you think the client should >>>>>>>> actively >>>>>>>> search for a broker that supports this RPC? As previously mentioned, >>>>>>>> the >>>>>>>> broker uses the leastLoadedNode to find its first connection (am >>>>>>>> I correct?), and what if that broker doesn't support the metric push? >>>>>>>> >>>>>>>> P >>>>>>>> >>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield < >>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>> >>>>>>>>> Hi Kirk, >>>>>>>>> Thanks for your question. You are correct that the presence or absence >>>>>>> of >>>>>>>>> the new RPCs in the >>>>>>>>> ApiVersionsResponse tells the client whether to request the telemetry >>>>>>>>> subscriptions and push >>>>>>>>> metrics. >>>>>>>>> >>>>>>>>> This is of course tricky in practice. It would be conceivable, as a >>>>>>>>> cluster is upgraded to AK 3.7 >>>>>>>>> or as a client metrics receiver plugin is deployed across the cluster, >>>>>>>>> that a client connects to some >>>>>>>>> brokers that support the new RPCs and some that do not. >>>>>>>>> >>>>>>>>> Here’s my suggestion: >>>>>>>>> * If a client is not connected to any brokers that support in the new >>>>>>>>> RPCs, it cannot push metrics. >>>>>>>>> * If a client is only connected to brokers that support the new RPCs, >>>>>>>>> it >>>>>>>>> will use the new RPCs in >>>>>>>>> accordance with the KIP. >>>>>>>>> * If a client is connected to some brokers that support the new RPCs >>>>>>>>> and >>>>>>>>> some that do not, it will >>>>>>>>> use the new RPCs with the supporting subset of brokers in accordance >>>>>>> with >>>>>>>>> the KIP. >>>>>>>>> >>>>>>>>> Comments? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Andrew >>>>>>>>> >>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote: >>>>>>>>>> >>>>>>>>>> Hi Andrew/Jun, >>>>>>>>>> >>>>>>>>>> I want to make sure I understand question/comment #119… In the case >>>>>>>>> where a cluster without a metrics client receiver is later >>>>>>>>> reconfigured >>>>>>> and >>>>>>>>> restarted to include a metrics client receiver, do we want the client >>>>>>>>> to >>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s >>>>>>>>> response >>>>>>> to >>>>>>>>> question #119, it sounds like we’re using the presence/absence of the >>>>>>>>> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push >>>>>>>>> indicator. Do I have that correct? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Kirk >>>>>>>>>> >>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield < >>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Jun, >>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify where >>>>>>>>> necessary. >>>>>>>>>>> >>>>>>>>>>> 110. Yes, agree. The motivation section mentions this. >>>>>>>>>>> >>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the >>>>>>>>> replacement of >>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I >>>>>>>>> think it’s a bit >>>>>>>>>>> of a debatable point. OTLP makes a distinction between a namespace >>>>>>> and a >>>>>>>>>>> multi-word component. If it was “client.id” then “client” would be a >>>>>>>>> namespace with >>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it was >>>>>>>>> intentional, but debatable. >>>>>>>>>>> >>>>>>>>>>> 112. Thanks. The link target moved. Fixed. >>>>>>>>>>> >>>>>>>>>>> 113. Thanks. Fixed. >>>>>>>>>>> >>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it should use >>>>>>> the >>>>>>>>> exact same >>>>>>>>>>> name. If a standard metric doesn’t make sense for a client, then it >>>>>>> can >>>>>>>>> omit that metric. >>>>>>>>>>> >>>>>>>>>>> For a required metric, the situation is stronger. All clients must >>>>>>>>> implement these >>>>>>>>>>> metrics with these names in order to implement the KIP. But the >>>>>>>>> required metrics >>>>>>>>>>> are essentially the number of connections and the request latency, >>>>>>>>> which do not >>>>>>>>>>> reference the underlying implementation of the client (which >>>>>>>>> producer.record.queue.time.max >>>>>>>>>>> of course does). >>>>>>>>>>> >>>>>>>>>>> I suppose someone might build a producer-only client that didn’t >>>>>>>>>>> have >>>>>>>>> consumer metrics. >>>>>>>>>>> In this case, the consumer metrics would conceptually have the >>>>>>>>>>> value 0 >>>>>>>>> and would not >>>>>>>>>>> need to be sent to the broker. >>>>>>>>>>> >>>>>>>>>>> 114.2. If a client does not implement some metrics, they will not be >>>>>>>>> available for >>>>>>>>>>> analysis and troubleshooting. It just makes the ability to combine >>>>>>>>> metrics from lots >>>>>>>>>>> different clients less complete. >>>>>>>>>>> >>>>>>>>>>> 115. I think it was probably a mistake to be so specific about >>>>>>>>> threading in this KIP. >>>>>>>>>>> When the consumer threading refactor is complete, of course, it >>>>>>>>>>> would >>>>>>>>> do the appropriate >>>>>>>>>>> equivalent. I’ve added a clarification and massively simplified this >>>>>>>>> section. >>>>>>>>>>> >>>>>>>>>>> 116. I removed “client.terminating”. >>>>>>>>>>> >>>>>>>>>>> 117. Yes. Horrid. Fixed. >>>>>>>>>>> >>>>>>>>>>> 118. The Terminating flag just indicates that this is the final >>>>>>>>> PushTelemetryRequest >>>>>>>>>>> from this client. Any subsequent request will be rejected. I think >>>>>>> this >>>>>>>>> flag should remain. >>>>>>>>>>> >>>>>>>>>>> 119. Good catch. This was actually contradicting another part of the >>>>>>>>> KIP. The current behaviour >>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client metrics >>>>>>>>> receiver plugin, the new RPCs >>>>>>>>>>> in this KIP are “turned off” and not reported in >>>>>>>>>>> ApiVersionsResponse. >>>>>>>>> The client will not >>>>>>>>>>> attempt to push metrics. >>>>>>>>>>> >>>>>>>>>>> 120. The error handling table lists the error codes for >>>>>>>>> PushTelemetryResponse. I’ve added one >>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any >>>>>>>>> error codes, since the >>>>>>>>>>> situation in which the client telemetry is not supported is handled >>>>>>>>>>> by >>>>>>>>> the RPCs not being offered >>>>>>>>>>> by the broker. >>>>>>>>>>> >>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific about >>>>>>>>> threading. Removed. >>>>>>>>>>> >>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation should be >>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be >>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER >>>>>>>>>>> resource. >>>>>>>>>>> >>>>>>>>>>> Thanks for the detailed review. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Andrew >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 110. Another potential motivation is the multiple clients support. >>>>>>>>> Some of >>>>>>>>>>>> the places may not have good monitoring support for non-java >>>>>>>>>>>> clients. >>>>>>>>>>>> >>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name >>>>>>> and >>>>>>>>>>>> replace '-' with '_' for attributes. Why is the inconsistency? >>>>>>>>>>>> >>>>>>>>>>>> 112. OTLP specification: Page is not found from the link. >>>>>>>>>>>> >>>>>>>>>>>> 113. "Defining standard and required metrics makes the monitoring >>>>>>>>>>>> and >>>>>>>>>>>> troubleshooting of clients from various client types ": Incomplete >>>>>>>>> sentence. >>>>>>>>>>>> >>>>>>>>>>>> 114. standard/required metrics >>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with the >>>>>>>>>>>> exact >>>>>>>>> same >>>>>>>>>>>> names? >>>>>>>>>>>> 114.2 What happens if some of those metrics are missing from a >>>>>>> client? >>>>>>>>>>>> >>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application >>>>>>>>>>>> threads": >>>>>>> We >>>>>>>>>>>> have an ongoing effort to refactor the consumer threading model ( >>>>>>>>>>>> >>>>>>>>> >>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design >>>>>>>>> ). >>>>>>>>>>>> Once this is done, PRC requests will only be made from the >>>>>>>>>>>> background >>>>>>>>>>>> thread. Should this KIP follow the new model only? >>>>>>>>>>>> >>>>>>>>>>>> 116. 'The metrics should contain the reason for the client >>>>>>> termination >>>>>>>>> by >>>>>>>>>>>> including the client.terminating metric with the label “reason” >>>>>>>>>>>> ...'. >>>>>>>>> Hmm, >>>>>>>>>>>> are we introducing a new metric client.terminating? If so, that >>>>>>>>>>>> needs >>>>>>>>> to be >>>>>>>>>>>> explicitly listed. >>>>>>>>>>>> >>>>>>>>>>>> 117. "As the metrics plugin may need to add additional metrics on >>>>>>>>>>>> top >>>>>>>>> of >>>>>>>>>>>> this the generic metrics receiver in the broker will not add these >>>>>>>>> labels >>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't read well. >>>>>>>>>>>> >>>>>>>>>>>> 118. "it is possible for the client to send at most one accepted >>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks in": If >>>>>>> we >>>>>>>>> do >>>>>>>>>>>> this, do we still need the Terminating flag in >>>>>>> PushTelemetryRequestV0? >>>>>>>>>>>> >>>>>>>>>>>> 119. "If there is no client metrics receiver plugin configured on >>>>>>>>>>>> the >>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest with >>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The client >>>>>>> should >>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the >>>>>>>>>>>> PushIntervalMs >>>>>>>>> has >>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or disabled >>>>>>>>> without >>>>>>>>>>>> having to restart the broker or reset the client connection." >>>>>>>>>>>> "no client metrics receiver plugin configured" is defined by no >>>>>>> metric >>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right? In that >>>>>>>>> case, >>>>>>>>>>>> it would be useful to avoid the clients sending >>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve the >>>>>>>>>>>> current >>>>>>>>>>>> behavior. >>>>>>>>>>>> >>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and >>>>>>>>>>>> PushTelemetryRequestV0: >>>>>>>>> Could >>>>>>>>>>>> we list error codes for each? >>>>>>>>>>>> >>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method >>>>>>>>>>>> may >>>>>>>>> be >>>>>>>>>>>> called from the request handling thread": Where else can this >>>>>>>>>>>> method >>>>>>> be >>>>>>>>>>>> called? >>>>>>>>>>>> >>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing >>>>>>>>>>>> the >>>>>>>>> ACL? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Jun >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield < >>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Milind, >>>>>>>>>>>>> Thanks for your question. >>>>>>>>>>>>> >>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely to be >>>>>>>>> caused by a >>>>>>>>>>>>> problem in the serialization in the client. I have changed the >>>>>>> client >>>>>>>>>>>>> action in this case >>>>>>>>>>>>> to “Log an error and stop pushing metrics”. >>>>>>>>>>>>> >>>>>>>>>>>>> I have updated the KIP text accordingly. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Andrew >>>>>>>>>>>>> >>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra >>>>>>>>> <milut...@confluent.io.INVALID> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>> Thanks for the clarifications. >>>>>>>>>>>>>> >>>>>>>>>>>>>> About 2b: >>>>>>>>>>>>>> In case a client has a bug while serializing, it might be >>>>>>>>>>>>>> difficult >>>>>>>>> for >>>>>>>>>>>>> the >>>>>>>>>>>>>> client to recover from that without code changes. In that, it >>>>>>>>>>>>>> might >>>>>>>>> be >>>>>>>>>>>>> good >>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the error >>>>>>>>>>>>>> as >>>>>>>>> fatal >>>>>>>>>>>>>> for the client (only fatal in terms of sending the metrics, the >>>>>>>>> client >>>>>>>>>>>>> can >>>>>>>>>>>>>> keep functioning otherwise). What do you think? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Milind >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < >>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Milind, >>>>>>>>>>>>>>> Thanks for your questions about the KIP. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical versions of >>>>>>>>>>>>>>> the >>>>>>>>> KIP. >>>>>>>>>>>>> I >>>>>>>>>>>>>>> think this is >>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push interval. >>>>>>>>>>>>>>> 30 >>>>>>>>>>>>> minutes >>>>>>>>>>>>>>> is a mystery >>>>>>>>>>>>>>> to me. I’ve updated the KIP. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2) I think there are two situations in which INVALID_RECORD >>>>>>>>>>>>>>> might >>>>>>>>> occur. >>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that the >>>>>>> broker >>>>>>>>> does >>>>>>>>>>>>>>> not support. >>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but there’s >>>>>>>>> only >>>>>>>>>>>>> one >>>>>>>>>>>>>>> supported >>>>>>>>>>>>>>> to start with. Until we have multiple content-types, this seems >>>>>>> out >>>>>>>>> of >>>>>>>>>>>>>>> scope. I think a >>>>>>>>>>>>>>> future KIP would add another error code for this. >>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the metrics >>>>>>>>> payload >>>>>>>>>>>>> is >>>>>>>>>>>>>>> malformed. >>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push on the >>>>>>>>>>>>>>> push >>>>>>>>>>>>>>> interval seems >>>>>>>>>>>>>>> appropriate. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an >>>>>>>>> immediate >>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or >>>>>>>>> perhaps >>>>>>>>>>>>>>> a situation in which a broker sends a compression type in a >>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse >>>>>>>>>>>>>>> which is subsequently not supported when its used with a >>>>>>>>>>>>>>> PushTelemetryRequest. >>>>>>>>>>>>>>> We do want the client to have the opportunity to get an >>>>>>>>>>>>>>> up-to-date >>>>>>>>> list >>>>>>>>>>>>> of >>>>>>>>>>>>>>> supported >>>>>>>>>>>>>>> compression types. I think an immediate >>>>>>>>> GetTelemetrySubscriptionsRequest >>>>>>>>>>>>>>> is appropriate. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null >>>>>>>>>>>>>>> ClientInstanceId, the >>>>>>>>>>>>>>> receiving broker may not already know the client's existing >>>>>>>>>>>>>>> ClientInstanceId. If the >>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it simply >>>>>>>>> responds >>>>>>>>>>>>>>> the existing >>>>>>>>>>>>>>> value back to the client. If it does not know the existing >>>>>>>>>>>>>>> ClientInstanceId, it will create >>>>>>>>>>>>>>> a new client instance ID and respond with that. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I will update the KIP with these clarifications. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra >>>>>>>>> <milut...@confluent.io.INVALID >>>>>>>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling" section. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to >>>>>>> eventually >>>>>>>>>>>>>>> trigger >>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the cluster >>>>>>> metrics >>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by operator error, >>>>>>>>> rolling >>>>>>>>>>>>>>>> upgrades, etc." >>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. What is >>>>>>> it >>>>>>>>>>>>>>>> referring to? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2. For the actual errors: >>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a warning to >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> application and schedule the next >>>>>>> GetTelemetrySubscriptionsRequest >>>>>>>>> to 5 >>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like >>>>>>>>> PushIntervalMs? >>>>>>>>>>>>>>> And >>>>>>>>>>>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest >>>>>>>>>>>>>>>> in >>>>>>>>> this >>>>>>>>>>>>>>>> case, if the serialization is broken? >>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just >>>>>>>>>>>>>>>> to >>>>>>>>>>>>> confirm, >>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled >>>>>>>>> immediately >>>>>>>>>>>>>>>> after the PushTelemetry response, correct? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must >>>>>>>>>>>>>>>> include >>>>>>>>> the >>>>>>>>>>>>>>>> ClientInstanceId returned in the first response, regardless of >>>>>>>>> broker": >>>>>>>>>>>>>>>> Will a broker error be returned in case some implementation of >>>>>>>>> this KIP >>>>>>>>>>>>>>>> violates this accidentally and sends a request with >>>>>>>>> ClientInstanceId = >>>>>>>>>>>>>>> Null >>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new >>>>>>>>> ClientInstanceId be >>>>>>>>>>>>>>>> returned without an error? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < >>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> I would like to start a new discussion thread on KIP-714: >>>>>>>>>>>>>>>>> Client >>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>>> and observability. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the scope. >>>>>>> The >>>>>>>>>>>>>>> overall >>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is unchanged, but >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics, rather than >>>>>>>>>>>>> introducing >>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster operators >>>>>>>>>>>>>>>>> investigate performance problems experienced by clients >>>>>>>>>>>>>>>>> without >>>>>>>>>>>>>>> requiring >>>>>>>>>>>>>>>>> changes to the client application code or configuration. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Andrew