Regarding the naming, I personally think `clientInstanceId` makes sense for the plain clients -- especially if we might later introduce the notion of an `applicationInstanceId`.
I'm not a huge fan of `clientsInstanceIds` for the Kafka Streams API, though, can we use `clientInstanceIds` instead? (The difference being the placement of the plural 's') I would similarly rename the class to just ClientInstanceIds we can also not have a timeout-less overload, because `KafkaStreams` does > not have a `default.api.timeout.ms` config either With respect to the timeout for the Kafka Streams API, I'm a bit confused by the doubletriple-negative of Matthias' comment here, but I was thinking about this earlier and this was my take: with the current proposal, we would allow users to pass in an absolute timeout as a parameter that would apply to the method as a whole. Meanwhile within the method we would issue separate calls to each of the clients using the default or user-configured value of their `default.api.timeout.ms` as the timeout parameter. So the API as proposed makes sense to me. On Wed, Oct 11, 2023 at 6:48 PM Matthias J. Sax <mj...@apache.org> wrote: > In can answer 130 and 131. > > 130) We cannot guarantee that all clients are already initialized due to > race conditions. We plan to not allow calling > `KafkaStreams#clientsInstanceIds()` when the state is not RUNNING (or > REBALANCING) though -- guess this slipped on the KIP and should be > added? But because StreamThreads can be added dynamically (and producer > might be created dynamically at runtime; cf below), we still cannot > guarantee that all clients are already initialized when the method is > called. Of course, we assume that all clients are most likely initialize > on the happy path, and blocking calls to `client.clientInstanceId()` > should be rare. > > To address the worst case, we won't do a naive implementation and just > loop over all clients, but fan-out the call to the different > StreamThreads (and GlobalStreamThread if it exists), and use Futures to > gather the results. > > Currently, `StreamThreads` has 3 clients (if ALOS or EOSv2 is used), so > we might do 3 blocking calls in the worst case (for EOSv1 we get a > producer per tasks, and we might end up doing more blocking calls if the > producers are not initialized yet). Note that EOSv1 is already > deprecated, and we are also working on thread refactoring that will > reduce the number of client on StreamThread to 2 -- and we have more > refactoring planned to reduce the number of clients even further. > > Inside `KafakStreams#clientsInstanceIds()` we might only do single > blocking call for the admin client (ie, `admin.clientInstanceId()`). > > I agree that we need to do some clever timeout management, but it seems > to be more of an implementation detail? > > Do you have any particular concerns, or does the proposed implementation > as sketched above address your question? > > > 130) If the Topology does not have a global-state-store, there won't be > a GlobalThread and thus not global consumer. Thus, we return an Optional. > > > > On three related question for Andrew. > > (1) Why is the method called `clientInstanceId()` and not just plain > `instanceId()`? > > (2) Why so we return a `String` while but not a UUID type? The added > protocol request/response classes use UUIDs. > > (3) Would it make sense to have an overloaded `clientInstanceId()` > method that does not take any parameter but uses `default.api.timeout` > config (this config does no exist on the producer though, so we could > only have it for consumer and admin at this point). We could of course > also add overloads like this later if user request them (and/or add > `default.api.timeout.ms` to the producer, too). > > Btw: For KafkaStreams, I think `clientsInstanceIds` still makes sense as > a method name though, as `KafkaStreams` itself does not have an > `instanceId` -- we can also not have a timeout-less overload, because > `KafkaStreams` does not have a `default.api.timeout.ms` config either > (and I don't think it make sense to add). > > > > -Matthias > > On 10/11/23 5:07 PM, Jun Rao wrote: > > Hi, Andrew, > > > > Thanks for the updated KIP. Just a few more minor comments. > > > > 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for > all > > consumer/producer/adminClient instances to be initialized? Are all those > > instances created during KafkaStreams initialization? > > > > 131. Why does globalConsumerInstanceId() return Optional<String> while > > other consumer instances don't return Optional? > > > > 132. ClientMetricsSubscriptionRequestCount: Do we need this since we > have a > > set of generic metrics > > (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that > > report Request rate for every request type? > > > > Thanks, > > > > Jun > > > > On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax <mj...@apache.org> > wrote: > > > >> Thanks! > >> > >> On 10/10/23 11:31 PM, Andrew Schofield wrote: > >>> Matthias, > >>> Yes, I think that’s a sensible way forward and the interface you > propose > >> looks good. I’ll update the KIP accordingly. > >>> > >>> Thanks, > >>> Andrew > >>> > >>>> On 10 Oct 2023, at 23:01, Matthias J. Sax <mj...@apache.org> wrote: > >>>> > >>>> Andrew, > >>>> > >>>> yes I would like to get this change into KIP-714 right way. Seems to > be > >> important, as we don't know if/when a follow-up KIP for Kafka Streams > would > >> land. > >>>> > >>>> I was also thinking (and discussed with a few others) how to expose > it, > >> and we would propose the following: > >>>> > >>>> We add a new method to `KafkaStreams` class: > >>>> > >>>> public ClientsInstanceIds clientsInstanceIds(Duration timeout); > >>>> > >>>> The returned object is like below: > >>>> > >>>> public class ClientsInstanceIds { > >>>> // we only have a single admin client per KS instance > >>>> String adminInstanceId(); > >>>> > >>>> // we only have a single global consumer per KS instance (if any) > >>>> // Optional<> because we might not have global-thread > >>>> Optional<String> globalConsumerInstanceId(); > >>>> > >>>> // return a <threadKey -> ClientInstanceId> mapping > >>>> // for the underlying (restore-)consumers/producers > >>>> Map<String, String> mainConsumerInstanceIds(); > >>>> Map<String, String> restoreConsumerInstanceIds(); > >>>> Map<String, String> producerInstanceIds(); > >>>> } > >>>> > >>>> For the `threadKey`, we would use some pattern like this: > >>>> > >>>> [Stream|StateUpdater]Thread-<threadIdx> > >>>> > >>>> > >>>> Would this work from your POV? > >>>> > >>>> > >>>> > >>>> -Matthias > >>>> > >>>> > >>>> On 10/9/23 2:15 AM, Andrew Schofield wrote: > >>>>> Hi Matthias, > >>>>> Good point. Makes sense to me. > >>>>> Is this something that can also be included in the proposed Kafka > >> Streams follow-on KIP, or would you prefer that I add it to KIP-714? > >>>>> I have a slight preference for the former to put all of the KS > >> enhancements into a separate KIP. > >>>>> Thanks, > >>>>> Andrew > >>>>>> On 7 Oct 2023, at 02:12, Matthias J. Sax <mj...@apache.org> wrote: > >>>>>> > >>>>>> Thanks Andrew. SGTM. > >>>>>> > >>>>>> One point you did not address is the idea to add a method to > >> `KafkaStreams` similar to the proposed `clientInstanceId()` that will be > >> added to consumer/producer/admin clients. > >>>>>> > >>>>>> Without addressing this, Kafka Streams users won't have a way to get > >> the assigned `instanceId` of the internally created clients, and thus it > >> would be very difficult for them to know which metrics that the broker > >> receives belong to a Kafka Streams app. It seems they would only find > the > >> `instanceIds` in the log4j output if they enable client logging? > >>>>>> > >>>>>> Of course, because there is multiple clients inside Kafka Streams, > >> the return type cannot be an single "String", but must be some some > complex > >> data structure -- we could either add a new class, or return a > >> Map<String,String> using a client key that maps to the `instanceId`. > >>>>>> > >>>>>> For example we could use the following key: > >>>>>> > >>>>>> > [Global]StreamThread[-<threadIndex>][-restore][consumer|producer] > >>>>>> > >>>>>> (Of course, only the valid combination.) > >>>>>> > >>>>>> Or maybe even better, we might want to return a `Future` because > >> collection all the `instanceId` might be a blocking all on each client? > I > >> have already a few idea how it could be implemented but I don't think it > >> must be discussed on the KIP, as it's an implementation detail. > >>>>>> > >>>>>> Thoughts? > >>>>>> > >>>>>> > >>>>>> -Matthias > >>>>>> > >>>>>> On 10/6/23 4:21 AM, Andrew Schofield wrote: > >>>>>>> Hi Matthias, > >>>>>>> Thanks for your comments. I agree that a follow-up KIP for Kafka > >> Streams makes sense. This KIP currently has made a bit > >>>>>>> of an effort to embrace KS, but it’s not enough by a long way. > >>>>>>> I have removed `application.id <http://application.id/>`. This > >> should be done properly in the follow-up KIP. I don’t believe there’s a > >> downside to > >>>>>>> removing it from this KIP. > >>>>>>> I have reworded the statement about temporarily. In practice, the > >> implementation of this KIP that’s going on while the voting > >>>>>>> progresses happens to use delta temporality, but that’s an > >> implementation detail. Supporting clients must support both > >>>>>>> temporalities. > >>>>>>> I thought about exposing the client instance ID as a metric, but > >> non-numeric metrics are not usual practice and tools > >>>>>>> do not universally support them. I don’t think the KIP is improved > >> by adding one now. > >>>>>>> I have also added constants for the various Config classes for > >> ENABLE_METRICS_PUSH_CONFIG, including to > >>>>>>> StreamsConfig. It’s best to be explicit about this. > >>>>>>> Thanks, > >>>>>>> Andrew > >>>>>>>> On 2 Oct 2023, at 23:47, Matthias J. Sax <mj...@apache.org> > wrote: > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I did not pay attention to this KIP in the past; seems it was > >> on-hold for a while. > >>>>>>>> > >>>>>>>> Overall it sounds very useful, and I think we should extend this > >> with a follow up KIP for Kafka Streams. What is unclear to me at this > point > >> is the statement: > >>>>>>>> > >>>>>>>>> Kafka Streams applications have an application.id configured and > >> this identifier should be included as the application_id metrics label. > >>>>>>>> > >>>>>>>> The `application.id` is currently only used as the (main) > >> consumer's `group.id` (and is part of an auto-generated `client.id` if > >> the user does not set one). > >>>>>>>> > >>>>>>>> This comment related to: > >>>>>>>> > >>>>>>>>> The following labels should be added by the client as appropriate > >> before metrics are pushed. > >>>>>>>> > >>>>>>>> Given that Kafka Streams uses the consumer/producer/admin client > as > >> "black boxes", a client does at this point not know that it's part of a > >> Kafka Streams application, and thus, it won't be able to attach any such > >> label to the metrics it sends. (Also producer and admin don't even know > the > >> value of `application.id` -- only the (main) consumer, indirectly via ` > >> group.id`, but also restore and global consumer don't know it, because > >> they don't have `group.id` set). > >>>>>>>> > >>>>>>>> While I am totally in favor of the proposal, I am wondering how we > >> intent to implement it in clean way? Or would we do ok to have some > >> internal client APIs that KS can use to "register" itself with the > client? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> While clients must support both temporalities, the broker will > >> initially only send > GetTelemetrySubscriptionsResponse.DeltaTemporality=True > >>>>>>>> > >>>>>>>> Not sure if I can follow. How make the decision about DELTA or > >> CUMULATIVE metrics? Should the broker side plugin not decide what > metrics > >> it what to receive in which form? So what does "initially" mean -- the > >> broker won't ship with a default plugin implementation? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> The following method is added to the Producer, Consumer, and > Admin > >> client interfaces: > >>>>>>>> > >>>>>>>> Should we add anything to Kafka Streams to expose the underlying > >> clients' assigned client-instance-ids programmatically? I am also > wondering > >> if clients should report their assigned client-instance-ids as metrics > >> itself (for this case, Kafka Streams won't need to do anything, because > we > >> already expose all client metrics). > >>>>>>>> > >>>>>>>> If we add anything programmatic, we need to make it simple, given > >> that Kafka Streams has many clients per `StreamThread` and may have > >> multiple threads. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> enable.metrics.push > >>>>>>>> It might be worth to add this to `StreamsConfig`, too? It set via > >> StreamsConfig, we would forward it to all clients automatically. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -Matthias > >>>>>>>> > >>>>>>>> > >>>>>>>> On 9/29/23 5:45 PM, David Jacot wrote: > >>>>>>>>> Hi Andrew, > >>>>>>>>> Thanks for driving this one. I haven't read all the KIP yet but I > >> already > >>>>>>>>> have an initial question. In the Threading section, it is written > >>>>>>>>> "KafkaConsumer: the "background" thread (based on the consumer > >> threading > >>>>>>>>> refactor which is underway)". If I understand this correctly, it > >> means > >>>>>>>>> that KIP-714 won't work if the "old consumer" is used. Am I > >> correct? > >>>>>>>>> Cheers, > >>>>>>>>> David > >>>>>>>>> On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield < > >>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>> Hi Philip, > >>>>>>>>>> No, I do not think it should actively search for a broker that > >> supports > >>>>>>>>>> the new > >>>>>>>>>> RPCs. In general, either all of the brokers or none of the > >> brokers will > >>>>>>>>>> support it. > >>>>>>>>>> In the window, where the cluster is being upgraded or client > >> telemetry is > >>>>>>>>>> being > >>>>>>>>>> enabled, there might be a mixed situation. I wouldn’t put too > >> much effort > >>>>>>>>>> into > >>>>>>>>>> this mixed scenario. As the client finds brokers which support > >> the new > >>>>>>>>>> RPCs, > >>>>>>>>>> it can begin to follow the KIP-714 mechanism. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Andrew > >>>>>>>>>> > >>>>>>>>>>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> > >> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Andrew - > >>>>>>>>>>> > >>>>>>>>>>> Question on top of your answers: Do you think the client should > >> actively > >>>>>>>>>>> search for a broker that supports this RPC? As previously > >> mentioned, the > >>>>>>>>>>> broker uses the leastLoadedNode to find its first connection > (am > >>>>>>>>>>> I correct?), and what if that broker doesn't support the metric > >> push? > >>>>>>>>>>> > >>>>>>>>>>> P > >>>>>>>>>>> > >>>>>>>>>>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield < > >>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Kirk, > >>>>>>>>>>>> Thanks for your question. You are correct that the presence or > >> absence > >>>>>>>>>> of > >>>>>>>>>>>> the new RPCs in the > >>>>>>>>>>>> ApiVersionsResponse tells the client whether to request the > >> telemetry > >>>>>>>>>>>> subscriptions and push > >>>>>>>>>>>> metrics. > >>>>>>>>>>>> > >>>>>>>>>>>> This is of course tricky in practice. It would be conceivable, > >> as a > >>>>>>>>>>>> cluster is upgraded to AK 3.7 > >>>>>>>>>>>> or as a client metrics receiver plugin is deployed across the > >> cluster, > >>>>>>>>>>>> that a client connects to some > >>>>>>>>>>>> brokers that support the new RPCs and some that do not. > >>>>>>>>>>>> > >>>>>>>>>>>> Here’s my suggestion: > >>>>>>>>>>>> * If a client is not connected to any brokers that support in > >> the new > >>>>>>>>>>>> RPCs, it cannot push metrics. > >>>>>>>>>>>> * If a client is only connected to brokers that support the > new > >> RPCs, it > >>>>>>>>>>>> will use the new RPCs in > >>>>>>>>>>>> accordance with the KIP. > >>>>>>>>>>>> * If a client is connected to some brokers that support the > new > >> RPCs and > >>>>>>>>>>>> some that do not, it will > >>>>>>>>>>>> use the new RPCs with the supporting subset of brokers in > >> accordance > >>>>>>>>>> with > >>>>>>>>>>>> the KIP. > >>>>>>>>>>>> > >>>>>>>>>>>> Comments? > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> Andrew > >>>>>>>>>>>> > >>>>>>>>>>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> > wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi Andrew/Jun, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I want to make sure I understand question/comment #119… In > the > >> case > >>>>>>>>>>>> where a cluster without a metrics client receiver is later > >> reconfigured > >>>>>>>>>> and > >>>>>>>>>>>> restarted to include a metrics client receiver, do we want the > >> client to > >>>>>>>>>>>> thereafter begin pushing metrics to the cluster? From Andrew’s > >> response > >>>>>>>>>> to > >>>>>>>>>>>> question #119, it sounds like we’re using the presence/absence > >> of the > >>>>>>>>>>>> relevant RPCs in ApiVersionsResponse as the > >> to-push-or-not-to-push > >>>>>>>>>>>> indicator. Do I have that correct? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Kirk > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield < > >>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Jun, > >>>>>>>>>>>>>> Thanks for your comments. I’ve updated the KIP to clarify > >> where > >>>>>>>>>>>> necessary. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 110. Yes, agree. The motivation section mentions this. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and > the > >>>>>>>>>>>> replacement of > >>>>>>>>>>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP > >> guidelines. I > >>>>>>>>>>>> think it’s a bit > >>>>>>>>>>>>>> of a debatable point. OTLP makes a distinction between a > >> namespace > >>>>>>>>>> and a > >>>>>>>>>>>>>> multi-word component. If it was “client.id” then “client” > >> would be a > >>>>>>>>>>>> namespace with > >>>>>>>>>>>>>> an attribute key “id”. But “client_id” is just a key. So, it > >> was > >>>>>>>>>>>> intentional, but debatable. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 112. Thanks. The link target moved. Fixed. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 113. Thanks. Fixed. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 114.1. If a standard metric makes sense for a client, it > >> should use > >>>>>>>>>> the > >>>>>>>>>>>> exact same > >>>>>>>>>>>>>> name. If a standard metric doesn’t make sense for a client, > >> then it > >>>>>>>>>> can > >>>>>>>>>>>> omit that metric. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> For a required metric, the situation is stronger. All > clients > >> must > >>>>>>>>>>>> implement these > >>>>>>>>>>>>>> metrics with these names in order to implement the KIP. But > >> the > >>>>>>>>>>>> required metrics > >>>>>>>>>>>>>> are essentially the number of connections and the request > >> latency, > >>>>>>>>>>>> which do not > >>>>>>>>>>>>>> reference the underlying implementation of the client (which > >>>>>>>>>>>> producer.record.queue.time.max > >>>>>>>>>>>>>> of course does). > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I suppose someone might build a producer-only client that > >> didn’t have > >>>>>>>>>>>> consumer metrics. > >>>>>>>>>>>>>> In this case, the consumer metrics would conceptually have > >> the value 0 > >>>>>>>>>>>> and would not > >>>>>>>>>>>>>> need to be sent to the broker. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 114.2. If a client does not implement some metrics, they > will > >> not be > >>>>>>>>>>>> available for > >>>>>>>>>>>>>> analysis and troubleshooting. It just makes the ability to > >> combine > >>>>>>>>>>>> metrics from lots > >>>>>>>>>>>>>> different clients less complete. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 115. I think it was probably a mistake to be so specific > about > >>>>>>>>>>>> threading in this KIP. > >>>>>>>>>>>>>> When the consumer threading refactor is complete, of course, > >> it would > >>>>>>>>>>>> do the appropriate > >>>>>>>>>>>>>> equivalent. I’ve added a clarification and massively > >> simplified this > >>>>>>>>>>>> section. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 116. I removed “client.terminating”. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 117. Yes. Horrid. Fixed. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 118. The Terminating flag just indicates that this is the > >> final > >>>>>>>>>>>> PushTelemetryRequest > >>>>>>>>>>>>>> from this client. Any subsequent request will be rejected. I > >> think > >>>>>>>>>> this > >>>>>>>>>>>> flag should remain. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 119. Good catch. This was actually contradicting another > part > >> of the > >>>>>>>>>>>> KIP. The current behaviour > >>>>>>>>>>>>>> is indeed preserved. If the broker doesn’t have a client > >> metrics > >>>>>>>>>>>> receiver plugin, the new RPCs > >>>>>>>>>>>>>> in this KIP are “turned off” and not reported in > >> ApiVersionsResponse. > >>>>>>>>>>>> The client will not > >>>>>>>>>>>>>> attempt to push metrics. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 120. The error handling table lists the error codes for > >>>>>>>>>>>> PushTelemetryResponse. I’ve added one > >>>>>>>>>>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t > >> have any > >>>>>>>>>>>> error codes, since the > >>>>>>>>>>>>>> situation in which the client telemetry is not supported is > >> handled by > >>>>>>>>>>>> the RPCs not being offered > >>>>>>>>>>>>>> by the broker. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 121. Again, I think it’s probably a mistake to be specific > >> about > >>>>>>>>>>>> threading. Removed. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 122. Good catch. For DescribeConfigs, the ACL operation > >> should be > >>>>>>>>>>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation > >> should be > >>>>>>>>>>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the > >> CLUSTER > >>>>>>>>>>>>>> resource. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for the detailed review. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Andrew > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 110. Another potential motivation is the multiple clients > >> support. > >>>>>>>>>>>> Some of > >>>>>>>>>>>>>>> the places may not have good monitoring support for > non-java > >> clients. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for > >> metric name > >>>>>>>>>> and > >>>>>>>>>>>>>>> replace '-' with '_' for attributes. Why is the > >> inconsistency? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 112. OTLP specification: Page is not found from the link. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 113. "Defining standard and required metrics makes the > >> monitoring and > >>>>>>>>>>>>>>> troubleshooting of clients from various client types ": > >> Incomplete > >>>>>>>>>>>> sentence. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 114. standard/required metrics > >>>>>>>>>>>>>>> 114.1 Do other clients need to implement those metrics with > >> the exact > >>>>>>>>>>>> same > >>>>>>>>>>>>>>> names? > >>>>>>>>>>>>>>> 114.2 What happens if some of those metrics are missing > from > >> a > >>>>>>>>>> client? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 115. "KafkaConsumer: both the "heart beat" and application > >> threads": > >>>>>>>>>> We > >>>>>>>>>>>>>>> have an ongoing effort to refactor the consumer threading > >> model ( > >>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design > >>>>>>>>>>>> ). > >>>>>>>>>>>>>>> Once this is done, PRC requests will only be made from the > >> background > >>>>>>>>>>>>>>> thread. Should this KIP follow the new model only? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 116. 'The metrics should contain the reason for the client > >>>>>>>>>> termination > >>>>>>>>>>>> by > >>>>>>>>>>>>>>> including the client.terminating metric with the label > >> “reason” ...'. > >>>>>>>>>>>> Hmm, > >>>>>>>>>>>>>>> are we introducing a new metric client.terminating? If so, > >> that needs > >>>>>>>>>>>> to be > >>>>>>>>>>>>>>> explicitly listed. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 117. "As the metrics plugin may need to add additional > >> metrics on top > >>>>>>>>>>>> of > >>>>>>>>>>>>>>> this the generic metrics receiver in the broker will not > add > >> these > >>>>>>>>>>>> labels > >>>>>>>>>>>>>>> but rely on the plugins to do so," The sentence doesn't > read > >> well. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 118. "it is possible for the client to send at most one > >> accepted > >>>>>>>>>>>>>>> out-of-profile per connection before the rate-limiter kicks > >> in": If > >>>>>>>>>> we > >>>>>>>>>>>> do > >>>>>>>>>>>>>>> this, do we still need the Terminating flag in > >>>>>>>>>> PushTelemetryRequestV0? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 119. "If there is no client metrics receiver plugin > >> configured on the > >>>>>>>>>>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest > >> with > >>>>>>>>>>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The > >> client > >>>>>>>>>> should > >>>>>>>>>>>>>>> send a new GetTelemetrySubscriptionsRequest after the > >> PushIntervalMs > >>>>>>>>>>>> has > >>>>>>>>>>>>>>> expired. This allows the metrics receiver to be enabled or > >> disabled > >>>>>>>>>>>> without > >>>>>>>>>>>>>>> having to restart the broker or reset the client > connection." > >>>>>>>>>>>>>>> "no client metrics receiver plugin configured" is defined > by > >> no > >>>>>>>>>> metric > >>>>>>>>>>>>>>> reporter implementing the ClientTelemetry interface, right? > >> In that > >>>>>>>>>>>> case, > >>>>>>>>>>>>>>> it would be useful to avoid the clients sending > >>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve > >> the current > >>>>>>>>>>>>>>> behavior. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and > >> PushTelemetryRequestV0: > >>>>>>>>>>>> Could > >>>>>>>>>>>>>>> we list error codes for each? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This > >> method may > >>>>>>>>>>>> be > >>>>>>>>>>>>>>> called from the request handling thread": Where else can > >> this method > >>>>>>>>>> be > >>>>>>>>>>>>>>> called? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we > >> changing the > >>>>>>>>>>>> ACL? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Jun > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield < > >>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi Milind, > >>>>>>>>>>>>>>>> Thanks for your question. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On reflection, I agree that INVALID_RECORD is most likely > >> to be > >>>>>>>>>>>> caused by a > >>>>>>>>>>>>>>>> problem in the serialization in the client. I have changed > >> the > >>>>>>>>>> client > >>>>>>>>>>>>>>>> action in this case > >>>>>>>>>>>>>>>> to “Log an error and stop pushing metrics”. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I have updated the KIP text accordingly. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> Andrew > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra > >>>>>>>>>>>> <milut...@confluent.io.INVALID> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Hi Andrew, > >>>>>>>>>>>>>>>>> Thanks for the clarifications. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> About 2b: > >>>>>>>>>>>>>>>>> In case a client has a bug while serializing, it might be > >> difficult > >>>>>>>>>>>> for > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> client to recover from that without code changes. In > that, > >> it might > >>>>>>>>>>>> be > >>>>>>>>>>>>>>>> good > >>>>>>>>>>>>>>>>> to just log the INVALID_RECORD as an error, and treat the > >> error as > >>>>>>>>>>>> fatal > >>>>>>>>>>>>>>>>> for the client (only fatal in terms of sending the > >> metrics, the > >>>>>>>>>>>> client > >>>>>>>>>>>>>>>> can > >>>>>>>>>>>>>>>>> keep functioning otherwise). What do you think? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>> Milind > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < > >>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi Milind, > >>>>>>>>>>>>>>>>>> Thanks for your questions about the KIP. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> 1) I did some archaeology and looked at historical > >> versions of the > >>>>>>>>>>>> KIP. > >>>>>>>>>>>>>>>> I > >>>>>>>>>>>>>>>>>> think this is > >>>>>>>>>>>>>>>>>> just a mistake. 5 minutes is the default metric push > >> interval. 30 > >>>>>>>>>>>>>>>> minutes > >>>>>>>>>>>>>>>>>> is a mystery > >>>>>>>>>>>>>>>>>> to me. I’ve updated the KIP. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> 2) I think there are two situations in which > >> INVALID_RECORD might > >>>>>>>>>>>> occur. > >>>>>>>>>>>>>>>>>> a) The client might perhaps be using a content-type that > >> the > >>>>>>>>>> broker > >>>>>>>>>>>> does > >>>>>>>>>>>>>>>>>> not support. > >>>>>>>>>>>>>>>>>> The KIP mentions content-type as a future extension, but > >> there’s > >>>>>>>>>>>> only > >>>>>>>>>>>>>>>> one > >>>>>>>>>>>>>>>>>> supported > >>>>>>>>>>>>>>>>>> to start with. Until we have multiple content-types, > this > >> seems > >>>>>>>>>> out > >>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>> scope. I think a > >>>>>>>>>>>>>>>>>> future KIP would add another error code for this. > >>>>>>>>>>>>>>>>>> b) The client might perhaps have a bug which means the > >> metrics > >>>>>>>>>>>> payload > >>>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>> malformed. > >>>>>>>>>>>>>>>>>> Logging a warning and attempting the next metrics push > on > >> the push > >>>>>>>>>>>>>>>>>> interval seems > >>>>>>>>>>>>>>>>>> appropriate. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by > making > >> an > >>>>>>>>>>>> immediate > >>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsRequest. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client > >> bug or > >>>>>>>>>>>> perhaps > >>>>>>>>>>>>>>>>>> a situation in which a broker sends a compression type > in > >> a > >>>>>>>>>>>>>>>>>> GetTelemetrySubscriptionsResponse > >>>>>>>>>>>>>>>>>> which is subsequently not supported when its used with a > >>>>>>>>>>>>>>>>>> PushTelemetryRequest. > >>>>>>>>>>>>>>>>>> We do want the client to have the opportunity to get an > >> up-to-date > >>>>>>>>>>>> list > >>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>> supported > >>>>>>>>>>>>>>>>>> compression types. I think an immediate > >>>>>>>>>>>> GetTelemetrySubscriptionsRequest > >>>>>>>>>>>>>>>>>> is appropriate. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> 3) If a client attempts a subsequent handshake with a > Null > >>>>>>>>>>>>>>>>>> ClientInstanceId, the > >>>>>>>>>>>>>>>>>> receiving broker may not already know the client's > >> existing > >>>>>>>>>>>>>>>>>> ClientInstanceId. If the > >>>>>>>>>>>>>>>>>> receiving broker knows the existing ClientInstanceId, it > >> simply > >>>>>>>>>>>> responds > >>>>>>>>>>>>>>>>>> the existing > >>>>>>>>>>>>>>>>>> value back to the client. If it does not know the > existing > >>>>>>>>>>>>>>>>>> ClientInstanceId, it will create > >>>>>>>>>>>>>>>>>> a new client instance ID and respond with that. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I will update the KIP with these clarifications. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>> Andrew > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra > >>>>>>>>>>>> <milut...@confluent.io.INVALID > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hi Andrew, thanks for this KIP. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I had a few questions regarding the "Error handling" > >> section. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to > >>>>>>>>>> eventually > >>>>>>>>>>>>>>>>>> trigger > >>>>>>>>>>>>>>>>>>> a retry and avoid having to restart clients if the > >> cluster > >>>>>>>>>> metrics > >>>>>>>>>>>>>>>>>>> configuration is disabled temporarily, e.g., by > operator > >> error, > >>>>>>>>>>>> rolling > >>>>>>>>>>>>>>>>>>> upgrades, etc." > >>>>>>>>>>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. > >> What is > >>>>>>>>>> it > >>>>>>>>>>>>>>>>>>> referring to? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> 2. For the actual errors: > >>>>>>>>>>>>>>>>>>> INVALID_RECORD : The action required is to "Log a > >> warning to the > >>>>>>>>>>>>>>>>>>> application and schedule the next > >>>>>>>>>> GetTelemetrySubscriptionsRequest > >>>>>>>>>>>> to 5 > >>>>>>>>>>>>>>>>>>> minutes". Why is this 5 minutes, and not something like > >>>>>>>>>>>> PushIntervalMs? > >>>>>>>>>>>>>>>>>> And > >>>>>>>>>>>>>>>>>>> also, why are we scheduling a > >> GetTelemetrySubscriptionsRequest in > >>>>>>>>>>>> this > >>>>>>>>>>>>>>>>>>> case, if the serialization is broken? > >>>>>>>>>>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE > : > >> just to > >>>>>>>>>>>>>>>> confirm, > >>>>>>>>>>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be > >> scheduled > >>>>>>>>>>>> immediately > >>>>>>>>>>>>>>>>>>> after the PushTelemetry response, correct? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests > >> must include > >>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> ClientInstanceId returned in the first response, > >> regardless of > >>>>>>>>>>>> broker": > >>>>>>>>>>>>>>>>>>> Will a broker error be returned in case some > >> implementation of > >>>>>>>>>>>> this KIP > >>>>>>>>>>>>>>>>>>> violates this accidentally and sends a request with > >>>>>>>>>>>> ClientInstanceId = > >>>>>>>>>>>>>>>>>> Null > >>>>>>>>>>>>>>>>>>> even when it's been obtained already? Or will a new > >>>>>>>>>>>> ClientInstanceId be > >>>>>>>>>>>>>>>>>>> returned without an error? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < > >>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>> I would like to start a new discussion thread on > >> KIP-714: Client > >>>>>>>>>>>>>>>> metrics > >>>>>>>>>>>>>>>>>>>> and observability. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I have edited the proposal significantly to reduce the > >> scope. > >>>>>>>>>> The > >>>>>>>>>>>>>>>>>> overall > >>>>>>>>>>>>>>>>>>>> mechanism for client metric subscriptions is > unchanged, > >> but the > >>>>>>>>>>>>>>>>>>>> KIP is now based on the existing client metrics, > rather > >> than > >>>>>>>>>>>>>>>> introducing > >>>>>>>>>>>>>>>>>>>> new metrics. The purpose remains helping cluster > >> operators > >>>>>>>>>>>>>>>>>>>> investigate performance problems experienced by > clients > >> without > >>>>>>>>>>>>>>>>>> requiring > >>>>>>>>>>>>>>>>>>>> changes to the client application code or > configuration. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>> Andrew > >>> > >>> > >> > > >