Hi, Andrew, Thanks for the updated KIP. Just a few more minor comments.
130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for all consumer/producer/adminClient instances to be initialized? Are all those instances created during KafkaStreams initialization? 131. Why does globalConsumerInstanceId() return Optional<String> while other consumer instances don't return Optional? 132. ClientMetricsSubscriptionRequestCount: Do we need this since we have a set of generic metrics (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that report Request rate for every request type? Thanks, Jun On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org> wrote: > Thanks! > > On 10/10/23 11:31 PM, Andrew Schofield wrote: > > Matthias, > > Yes, I think that’s a sensible way forward and the interface you propose > looks good. I’ll update the KIP accordingly. > > > > Thanks, > > Andrew > > > >> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote: > >> > >> Andrew, > >> > >> yes I would like to get this change into KIP-714 right way. Seems to be > important, as we don't know if/when a follow-up KIP for Kafka Streams would > land. > >> > >> I was also thinking (and discussed with a few others) how to expose it, > and we would propose the following: > >> > >> We add a new method to `KafkaStreams` class: > >> > >> public ClientsInstanceIds clientsInstanceIds(Duration timeout); > >> > >> The returned object is like below: > >> > >> public class ClientsInstanceIds { > >> // we only have a single admin client per KS instance > >> String adminInstanceId(); > >> > >> // we only have a single global consumer per KS instance (if any) > >> // Optional<> because we might not have global-thread > >> Optional<String> globalConsumerInstanceId(); > >> > >> // return a <threadKey -> ClientInstanceId> mapping > >> // for the underlying (restore-)consumers/producers > >> Map<String, String> mainConsumerInstanceIds(); > >> Map<String, String> restoreConsumerInstanceIds(); > >> Map<String, String> producerInstanceIds(); > >> } > >> > >> For the `threadKey`, we would use some pattern like this: > >> > >> [Stream|StateUpdater]Thread-<threadIdx> > >> > >> > >> Would this work from your POV? > >> > >> > >> > >> -Matthias > >> > >> > >> On 10/9/23 2:15 AM, Andrew Schofield wrote: > >>> Hi Matthias, > >>> Good point. Makes sense to me. > >>> Is this something that can also be included in the proposed Kafka > Streams follow-on KIP, or would you prefer that I add it to KIP-714? > >>> I have a slight preference for the former to put all of the KS > enhancements into a separate KIP. > >>> Thanks, > >>> Andrew > >>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote: > >>>> > >>>> Thanks Andrew. SGTM. > >>>> > >>>> One point you did not address is the idea to add a method to > `KafkaStreams` similar to the proposed `clientInstanceId()` that will be > added to consumer/producer/admin clients. > >>>> > >>>> Without addressing this, Kafka Streams users won't have a way to get > the assigned `instanceId` of the internally created clients, and thus it > would be very difficult for them to know which metrics that the broker > receives belong to a Kafka Streams app. It seems they would only find the > `instanceIds` in the log4j output if they enable client logging? > >>>> > >>>> Of course, because there is multiple clients inside Kafka Streams, > the return type cannot be an single "String", but must be some some complex > data structure -- we could either add a new class, or return a > Map<String,String> using a client key that maps to the `instanceId`. > >>>> > >>>> For example we could use the following key: > >>>> > >>>> [Global]StreamThread[-<threadIndex>][-restore][consumer|producer] > >>>> > >>>> (Of course, only the valid combination.) > >>>> > >>>> Or maybe even better, we might want to return a `Future` because > collection all the `instanceId` might be a blocking all on each client? I > have already a few idea how it could be implemented but I don't think it > must be discussed on the KIP, as it's an implementation detail. > >>>> > >>>> Thoughts? > >>>> > >>>> > >>>> -Matthias > >>>> > >>>> On 10/6/23 4:21 AM, Andrew Schofield wrote: > >>>>> Hi Matthias, > >>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka > Streams makes sense. This KIP currently has made a bit > >>>>> of an effort to embrace KS, but it’s not enough by a long way. > >>>>> I have removed `application.id <http://application.id/>`. This > should be done properly in the follow-up KIP. I don’t believe there’s a > downside to > >>>>> removing it from this KIP. > >>>>> I have reworded the statement about temporarily. In practice, the > implementation of this KIP that’s going on while the voting > >>>>> progresses happens to use delta temporality, but that’s an > implementation detail. Supporting clients must support both > >>>>> temporalities. > >>>>> I thought about exposing the client instance ID as a metric, but > non-numeric metrics are not usual practice and tools > >>>>> do not universally support them. I don’t think the KIP is improved > by adding one now. > >>>>> I have also added constants for the various Config classes for > ENABLE_METRICS_PUSH_CONFIG, including to > >>>>> StreamsConfig. It’s best to be explicit about this. > >>>>> Thanks, > >>>>> Andrew > >>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I did not pay attention to this KIP in the past; seems it was > on-hold for a while. > >>>>>> > >>>>>> Overall it sounds very useful, and I think we should extend this > with a follow up KIP for Kafka Streams. What is unclear to me at this point > is the statement: > >>>>>> > >>>>>>> Kafka Streams applications have an application.id configured and > this identifier should be included as the application_id metrics label. > >>>>>> > >>>>>> The `application.id` is currently only used as the (main) > consumer's `group.id` (and is part of an auto-generated `client.id` if > the user does not set one). > >>>>>> > >>>>>> This comment related to: > >>>>>> > >>>>>>> The following labels should be added by the client as appropriate > before metrics are pushed. > >>>>>> > >>>>>> Given that Kafka Streams uses the consumer/producer/admin client as > "black boxes", a client does at this point not know that it's part of a > Kafka Streams application, and thus, it won't be able to attach any such > label to the metrics it sends. (Also producer and admin don't even know the > value of `application.id` -- only the (main) consumer, indirectly via ` > group.id`, but also restore and global consumer don't know it, because > they don't have `group.id` set). > >>>>>> > >>>>>> While I am totally in favor of the proposal, I am wondering how we > intent to implement it in clean way? Or would we do ok to have some > internal client APIs that KS can use to "register" itself with the client? > >>>>>> > >>>>>> > >>>>>> > >>>>>>> While clients must support both temporalities, the broker will > initially only send GetTelemetrySubscriptionsResponse.DeltaTemporality=True > >>>>>> > >>>>>> Not sure if I can follow. How make the decision about DELTA or > CUMULATIVE metrics? Should the broker side plugin not decide what metrics > it what to receive in which form? So what does "initially" mean -- the > broker won't ship with a default plugin implementation? > >>>>>> > >>>>>> > >>>>>> > >>>>>>> The following method is added to the Producer, Consumer, and Admin > client interfaces: > >>>>>> > >>>>>> Should we add anything to Kafka Streams to expose the underlying > clients' assigned client-instance-ids programmatically? I am also wondering > if clients should report their assigned client-instance-ids as metrics > itself (for this case, Kafka Streams won't need to do anything, because we > already expose all client metrics). > >>>>>> > >>>>>> If we add anything programmatic, we need to make it simple, given > that Kafka Streams has many clients per `StreamThread` and may have > multiple threads. > >>>>>> > >>>>>> > >>>>>> > >>>>>>> enable.metrics.push > >>>>>> It might be worth to add this to `StreamsConfig`, too? It set via > StreamsConfig, we would forward it to all clients automatically. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -Matthias > >>>>>> > >>>>>> > >>>>>> On 9/29/23 5:45 PM, David Jacot wrote: > >>>>>>> Hi Andrew, > >>>>>>> Thanks for driving this one. I haven't read all the KIP yet but I > already > >>>>>>> have an initial question. In the Threading section, it is written > >>>>>>> "KafkaConsumer: the "background" thread (based on the consumer > threading > >>>>>>> refactor which is underway)". If I understand this correctly, it > means > >>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I > correct? > >>>>>>> Cheers, > >>>>>>> David > >>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield < > >>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>> Hi Philip, > >>>>>>>> No, I do not think it should actively search for a broker that > supports > >>>>>>>> the new > >>>>>>>> RPCs. In general, either all of the brokers or none of the > brokers will > >>>>>>>> support it. > >>>>>>>> In the window, where the cluster is being upgraded or client > telemetry is > >>>>>>>> being > >>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too > much effort > >>>>>>>> into > >>>>>>>> this mixed scenario. As the client finds brokers which support > the new > >>>>>>>> RPCs, > >>>>>>>> it can begin to follow the KIP-714 mechanism. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Andrew > >>>>>>>> > >>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> > wrote: > >>>>>>>>> > >>>>>>>>> Hi Andrew - > >>>>>>>>> > >>>>>>>>> Question on top of your answers: Do you think the client should > actively > >>>>>>>>> search for a broker that supports this RPC? As previously > mentioned, the > >>>>>>>>> broker uses the leastLoadedNode to find its first connection (am > >>>>>>>>> I correct?), and what if that broker doesn't support the metric > push? > >>>>>>>>> > >>>>>>>>> P > >>>>>>>>> > >>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield < > >>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Kirk, > >>>>>>>>>> Thanks for your question. You are correct that the presence or > absence > >>>>>>>> of > >>>>>>>>>> the new RPCs in the > >>>>>>>>>> ApiVersionsResponse tells the client whether to request the > telemetry > >>>>>>>>>> subscriptions and push > >>>>>>>>>> metrics. > >>>>>>>>>> > >>>>>>>>>> This is of course tricky in practice. It would be conceivable, > as a > >>>>>>>>>> cluster is upgraded to AK 3.7 > >>>>>>>>>> or as a client metrics receiver plugin is deployed across the > cluster, > >>>>>>>>>> that a client connects to some > >>>>>>>>>> brokers that support the new RPCs and some that do not. > >>>>>>>>>> > >>>>>>>>>> Here’s my suggestion: > >>>>>>>>>> * If a client is not connected to any brokers that support in > the new > >>>>>>>>>> RPCs, it cannot push metrics. > >>>>>>>>>> * If a client is only connected to brokers that support the new > RPCs, it > >>>>>>>>>> will use the new RPCs in > >>>>>>>>>> accordance with the KIP. > >>>>>>>>>> * If a client is connected to some brokers that support the new > RPCs and > >>>>>>>>>> some that do not, it will > >>>>>>>>>> use the new RPCs with the supporting subset of brokers in > accordance > >>>>>>>> with > >>>>>>>>>> the KIP. > >>>>>>>>>> > >>>>>>>>>> Comments? > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Andrew > >>>>>>>>>> > >>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Andrew/Jun, > >>>>>>>>>>> > >>>>>>>>>>> I want to make sure I understand question/comment #119… In the > case > >>>>>>>>>> where a cluster without a metrics client receiver is later > reconfigured > >>>>>>>> and > >>>>>>>>>> restarted to include a metrics client receiver, do we want the > client to > >>>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s > response > >>>>>>>> to > >>>>>>>>>> question #119, it sounds like we’re using the presence/absence > of the > >>>>>>>>>> relevant RPCs in ApiVersionsResponse as the > to-push-or-not-to-push > >>>>>>>>>> indicator. Do I have that correct? > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Kirk > >>>>>>>>>>> > >>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield < > >>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Hi Jun, > >>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify > where > >>>>>>>>>> necessary. > >>>>>>>>>>>> > >>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this. > >>>>>>>>>>>> > >>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the > >>>>>>>>>> replacement of > >>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP > guidelines. I > >>>>>>>>>> think it’s a bit > >>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a > namespace > >>>>>>>> and a > >>>>>>>>>>>> multi-word component. If it was “client.id” then “client” > would be a > >>>>>>>>>> namespace with > >>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it > was > >>>>>>>>>> intentional, but debatable. > >>>>>>>>>>>> > >>>>>>>>>>>> 112. Thanks. The link target moved. Fixed. > >>>>>>>>>>>> > >>>>>>>>>>>> 113. Thanks. Fixed. > >>>>>>>>>>>> > >>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it > should use > >>>>>>>> the > >>>>>>>>>> exact same > >>>>>>>>>>>> name. If a standard metric doesn’t make sense for a client, > then it > >>>>>>>> can > >>>>>>>>>> omit that metric. > >>>>>>>>>>>> > >>>>>>>>>>>> For a required metric, the situation is stronger. All clients > must > >>>>>>>>>> implement these > >>>>>>>>>>>> metrics with these names in order to implement the KIP. But > the > >>>>>>>>>> required metrics > >>>>>>>>>>>> are essentially the number of connections and the request > latency, > >>>>>>>>>> which do not > >>>>>>>>>>>> reference the underlying implementation of the client (which > >>>>>>>>>> producer.record.queue.time.max > >>>>>>>>>>>> of course does). > >>>>>>>>>>>> > >>>>>>>>>>>> I suppose someone might build a producer-only client that > didn’t have > >>>>>>>>>> consumer metrics. > >>>>>>>>>>>> In this case, the consumer metrics would conceptually have > the value 0 > >>>>>>>>>> and would not > >>>>>>>>>>>> need to be sent to the broker. > >>>>>>>>>>>> > >>>>>>>>>>>> 114.2. If a client does not implement some metrics, they will > not be > >>>>>>>>>> available for > >>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to > combine > >>>>>>>>>> metrics from lots > >>>>>>>>>>>> different clients less complete. > >>>>>>>>>>>> > >>>>>>>>>>>> 115. I think it was probably a mistake to be so specific about > >>>>>>>>>> threading in this KIP. > >>>>>>>>>>>> When the consumer threading refactor is complete, of course, > it would > >>>>>>>>>> do the appropriate > >>>>>>>>>>>> equivalent. I’ve added a clarification and massively > simplified this > >>>>>>>>>> section. > >>>>>>>>>>>> > >>>>>>>>>>>> 116. I removed “client.terminating”. > >>>>>>>>>>>> > >>>>>>>>>>>> 117. Yes. Horrid. Fixed. > >>>>>>>>>>>> > >>>>>>>>>>>> 118. The Terminating flag just indicates that this is the > final > >>>>>>>>>> PushTelemetryRequest > >>>>>>>>>>>> from this client. Any subsequent request will be rejected. I > think > >>>>>>>> this > >>>>>>>>>> flag should remain. > >>>>>>>>>>>> > >>>>>>>>>>>> 119. Good catch. This was actually contradicting another part > of the > >>>>>>>>>> KIP. The current behaviour > >>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client > metrics > >>>>>>>>>> receiver plugin, the new RPCs > >>>>>>>>>>>> in this KIP are “turned off” and not reported in > ApiVersionsResponse. > >>>>>>>>>> The client will not > >>>>>>>>>>>> attempt to push metrics. > >>>>>>>>>>>> > >>>>>>>>>>>> 120. The error handling table lists the error codes for > >>>>>>>>>> PushTelemetryResponse. I’ve added one > >>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t > have any > >>>>>>>>>> error codes, since the > >>>>>>>>>>>> situation in which the client telemetry is not supported is > handled by > >>>>>>>>>> the RPCs not being offered > >>>>>>>>>>>> by the broker. > >>>>>>>>>>>> > >>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific > about > >>>>>>>>>> threading. Removed. > >>>>>>>>>>>> > >>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation > should be > >>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation > should be > >>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the > CLUSTER > >>>>>>>>>>>> resource. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks for the detailed review. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> Andrew > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> 110. Another potential motivation is the multiple clients > support. > >>>>>>>>>> Some of > >>>>>>>>>>>>> the places may not have good monitoring support for non-java > clients. > >>>>>>>>>>>>> > >>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for > metric name > >>>>>>>> and > >>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the > inconsistency? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link. > >>>>>>>>>>>>> > >>>>>>>>>>>>> 113. "Defining standard and required metrics makes the > monitoring and > >>>>>>>>>>>>> troubleshooting of clients from various client types ": > Incomplete > >>>>>>>>>> sentence. > >>>>>>>>>>>>> > >>>>>>>>>>>>> 114. standard/required metrics > >>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with > the exact > >>>>>>>>>> same > >>>>>>>>>>>>> names? > >>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing from > a > >>>>>>>> client? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application > threads": > >>>>>>>> We > >>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading > model ( > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design > >>>>>>>>>> ). > >>>>>>>>>>>>> Once this is done, PRC requests will only be made from the > background > >>>>>>>>>>>>> thread. Should this KIP follow the new model only? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 116. 'The metrics should contain the reason for the client > >>>>>>>> termination > >>>>>>>>>> by > >>>>>>>>>>>>> including the client.terminating metric with the label > “reason” ...'. > >>>>>>>>>> Hmm, > >>>>>>>>>>>>> are we introducing a new metric client.terminating? If so, > that needs > >>>>>>>>>> to be > >>>>>>>>>>>>> explicitly listed. > >>>>>>>>>>>>> > >>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional > metrics on top > >>>>>>>>>> of > >>>>>>>>>>>>> this the generic metrics receiver in the broker will not add > these > >>>>>>>>>> labels > >>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't read > well. > >>>>>>>>>>>>> > >>>>>>>>>>>>> 118. "it is possible for the client to send at most one > accepted > >>>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks > in": If > >>>>>>>> we > >>>>>>>>>> do > >>>>>>>>>>>>> this, do we still need the Terminating flag in > >>>>>>>> PushTelemetryRequestV0? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin > configured on the > >>>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest > with > >>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The > client > >>>>>>>> should > >>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the > PushIntervalMs > >>>>>>>>>> has > >>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or > disabled > >>>>>>>>>> without > >>>>>>>>>>>>> having to restart the broker or reset the client connection." > >>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined by > no > >>>>>>>> metric > >>>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right? > In that > >>>>>>>>>> case, > >>>>>>>>>>>>> it would be useful to avoid the clients sending > >>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve > the current > >>>>>>>>>>>>> behavior. > >>>>>>>>>>>>> > >>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and > PushTelemetryRequestV0: > >>>>>>>>>> Could > >>>>>>>>>>>>> we list error codes for each? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This > method may > >>>>>>>>>> be > >>>>>>>>>>>>> called from the request handling thread": Where else can > this method > >>>>>>>> be > >>>>>>>>>>>>> called? > >>>>>>>>>>>>> > >>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we > changing the > >>>>>>>>>> ACL? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Jun > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield < > >>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Milind, > >>>>>>>>>>>>>> Thanks for your question. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely > to be > >>>>>>>>>> caused by a > >>>>>>>>>>>>>> problem in the serialization in the client. I have changed > the > >>>>>>>> client > >>>>>>>>>>>>>> action in this case > >>>>>>>>>>>>>> to “Log an error and stop pushing metrics”. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I have updated the KIP text accordingly. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Andrew > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra > >>>>>>>>>> <milut...@confluent.io.INVALID> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Andrew, > >>>>>>>>>>>>>>> Thanks for the clarifications. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> About 2b: > >>>>>>>>>>>>>>> In case a client has a bug while serializing, it might be > difficult > >>>>>>>>>> for > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>> client to recover from that without code changes. In that, > it might > >>>>>>>>>> be > >>>>>>>>>>>>>> good > >>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the > error as > >>>>>>>>>> fatal > >>>>>>>>>>>>>>> for the client (only fatal in terms of sending the > metrics, the > >>>>>>>>>> client > >>>>>>>>>>>>>> can > >>>>>>>>>>>>>>> keep functioning otherwise). What do you think? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>> Milind > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < > >>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi Milind, > >>>>>>>>>>>>>>>> Thanks for your questions about the KIP. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical > versions of the > >>>>>>>>>> KIP. > >>>>>>>>>>>>>> I > >>>>>>>>>>>>>>>> think this is > >>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push > interval. 30 > >>>>>>>>>>>>>> minutes > >>>>>>>>>>>>>>>> is a mystery > >>>>>>>>>>>>>>>> to me. I’ve updated the KIP. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> 2) I think there are two situations in which > INVALID_RECORD might > >>>>>>>>>> occur. > >>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that > the > >>>>>>>> broker > >>>>>>>>>> does > >>>>>>>>>>>>>>>> not support. > >>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but > there’s > >>>>>>>>>> only > >>>>>>>>>>>>>> one > >>>>>>>>>>>>>>>> supported > >>>>>>>>>>>>>>>> to start with. Until we have multiple content-types, this > seems > >>>>>>>> out > >>>>>>>>>> of > >>>>>>>>>>>>>>>> scope. I think a > >>>>>>>>>>>>>>>> future KIP would add another error code for this. > >>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the > metrics > >>>>>>>>>> payload > >>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>> malformed. > >>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push on > the push > >>>>>>>>>>>>>>>> interval seems > >>>>>>>>>>>>>>>> appropriate. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making > an > >>>>>>>>>> immediate > >>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client > bug or > >>>>>>>>>> perhaps > >>>>>>>>>>>>>>>> a situation in which a broker sends a compression type in > a > >>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse > >>>>>>>>>>>>>>>> which is subsequently not supported when its used with a > >>>>>>>>>>>>>>>> PushTelemetryRequest. > >>>>>>>>>>>>>>>> We do want the client to have the opportunity to get an > up-to-date > >>>>>>>>>> list > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>> supported > >>>>>>>>>>>>>>>> compression types. I think an immediate > >>>>>>>>>> GetTelemetrySubscriptionsRequest > >>>>>>>>>>>>>>>> is appropriate. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null > >>>>>>>>>>>>>>>> ClientInstanceId, the > >>>>>>>>>>>>>>>> receiving broker may not already know the client's > existing > >>>>>>>>>>>>>>>> ClientInstanceId. If the > >>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it > simply > >>>>>>>>>> responds > >>>>>>>>>>>>>>>> the existing > >>>>>>>>>>>>>>>> value back to the client. If it does not know the existing > >>>>>>>>>>>>>>>> ClientInstanceId, it will create > >>>>>>>>>>>>>>>> a new client instance ID and respond with that. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I will update the KIP with these clarifications. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> Andrew > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra > >>>>>>>>>> <milut...@confluent.io.INVALID > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling" > section. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to > >>>>>>>> eventually > >>>>>>>>>>>>>>>> trigger > >>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the > cluster > >>>>>>>> metrics > >>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by operator > error, > >>>>>>>>>> rolling > >>>>>>>>>>>>>>>>> upgrades, etc." > >>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. > What is > >>>>>>>> it > >>>>>>>>>>>>>>>>> referring to? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 2. For the actual errors: > >>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a > warning to the > >>>>>>>>>>>>>>>>> application and schedule the next > >>>>>>>> GetTelemetrySubscriptionsRequest > >>>>>>>>>> to 5 > >>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like > >>>>>>>>>> PushIntervalMs? > >>>>>>>>>>>>>>>> And > >>>>>>>>>>>>>>>>> also, why are we scheduling a > GetTelemetrySubscriptionsRequest in > >>>>>>>>>> this > >>>>>>>>>>>>>>>>> case, if the serialization is broken? > >>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : > just to > >>>>>>>>>>>>>> confirm, > >>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be > scheduled > >>>>>>>>>> immediately > >>>>>>>>>>>>>>>>> after the PushTelemetry response, correct? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests > must include > >>>>>>>>>> the > >>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response, > regardless of > >>>>>>>>>> broker": > >>>>>>>>>>>>>>>>> Will a broker error be returned in case some > implementation of > >>>>>>>>>> this KIP > >>>>>>>>>>>>>>>>> violates this accidentally and sends a request with > >>>>>>>>>> ClientInstanceId = > >>>>>>>>>>>>>>>> Null > >>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new > >>>>>>>>>> ClientInstanceId be > >>>>>>>>>>>>>>>>> returned without an error? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < > >>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on > KIP-714: Client > >>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>>>> and observability. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the > scope. > >>>>>>>> The > >>>>>>>>>>>>>>>> overall > >>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is unchanged, > but the > >>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics, rather > than > >>>>>>>>>>>>>> introducing > >>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster > operators > >>>>>>>>>>>>>>>>>> investigate performance problems experienced by clients > without > >>>>>>>>>>>>>>>> requiring > >>>>>>>>>>>>>>>>>> changes to the client application code or configuration. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>> Andrew > > > > >